πŸ“… 2023-03-08 β€” Session: Enhanced web scraping with BeautifulSoup

πŸ•’ 19:10–19:30
🏷️ Labels: Web Scraping, Beautifulsoup, Python, Html Parsing, Data Extraction
πŸ“‚ Project: Dev
⭐ Priority: MEDIUM

Session Goal

The session aimed to update and enhance a web scraping script using BeautifulSoup to improve data extraction from HTML pages.

Key Activities

  • Updated CSS selectors in the BeautifulSoup code to replace :contains with :-soup-contains, avoiding warnings.
  • Developed a Python function to scrape researcher and graduate student names into a pandas DataFrame.
  • Corrected a misspelled header tag in the web scraping function to ensure accurate data extraction.
  • Addressed encoding issues by specifying character encoding manually in BeautifulSoup.
  • Provided guidance on the correct URL for the Image Processing and Computer Vision Group’s webpage.
  • Suggested HTML code corrections to resolve search failures in BeautifulSoup.

Achievements

  • Successfully updated and corrected the web scraping script, enhancing its functionality and accuracy.
  • Resolved encoding issues and improved data extraction reliability.

Pending Tasks

  • Further testing of the updated web scraping function with different HTML pages to ensure robustness.
  • Exploration of additional BeautifulSoup features for more complex data extraction scenarios.