πŸ“… 2024-03-19 β€” Session: Developed Web Scraping Techniques for Dynamic Pages

πŸ•’ 22:25–23:20
🏷️ Labels: Web Scraping, Beautifulsoup, Python, Dynamic Content, Selenium
πŸ“‚ Project: Dev
⭐ Priority: MEDIUM

Session Goal

The session aimed to explore and implement web scraping techniques for both static and dynamic web pages, focusing on handling HTML structures and dynamically loaded content.

Key Activities

  • HTML Structure Planning: Outlined a schema for organizing HTML elements for data extraction, using main containers, titles, categories, subcategories, and products.
  • BeautifulSoup and Requests: Demonstrated Python code for fetching and parsing HTML content, addressing limitations with dynamically loaded content.
  • Error Handling in BeautifulSoup: Provided solutions for handling KeyError exceptions during scraping, including checking for attribute existence and correcting f-string usage.
  • CSV Encoding Solutions: Offered guidance on resolving encoding issues when saving CSV files, with emphasis on UTF-8 compatibility across different software.
  • Scraping Dynamic Content: Discussed challenges of scraping AngularJS-rendered pages, recommending Selenium, Puppeteer, and API checks.
  • Precios Claros Repositories: Compared GitHub repositories for scraping the β€˜Precios Claros’ website, evaluating technical approaches and user suitability.
  • OpenDataCordoba Guide: Detailed steps for using the OpenDataCordoba repository to scrape data from Precios Claros, including setup and execution.
  • Debugging with ipdb: Provided instructions on using the ipdb debugger for Python, including command usage and breakpoint management.

Achievements

  • Successfully outlined and implemented strategies for scraping both static and dynamic web pages.
  • Addressed common errors and provided solutions for efficient web scraping.
  • Evaluated and selected tools and repositories for specific scraping needs.

Pending Tasks

  • Further exploration of API availability for dynamic content extraction.
  • Implementation of automated scraping scripts using Selenium or Puppeteer for JavaScript-heavy pages.