πŸ“… 2023-03-08 β€” Session: Enhanced Web Scraping with BeautifulSoup

πŸ•’ 19:10–19:30
🏷️ Labels: Web Scraping, Beautifulsoup, Python, Html Parsing
πŸ“‚ Project: Dev
⭐ Priority: MEDIUM

Session Goal:

The goal of this session was to refine and enhance web scraping capabilities using BeautifulSoup in Python, focusing on extracting researcher data from HTML pages.

Key Activities:

  • Updated the CSS selector in BeautifulSoup code to use :-soup-contains instead of :contains to avoid warnings.
  • Developed a Python function to scrape researcher and graduate student names, returning the data in a pandas DataFrame.
  • Corrected a misspelled header tag from β€˜Reserchers’ to β€˜Researchers’ in the web scraping function.
  • Addressed encoding issues by specifying character encoding manually in BeautifulSoup to ensure proper data parsing.
  • Provided guidance on the correct URL for the Image Processing and Computer Vision Group’s webpage.
  • Suggested a workaround for a misspelling in the HTML code that affected data extraction.

Achievements:

  • Successfully updated and corrected web scraping scripts to handle CSS selector warnings, encoding issues, and HTML misspellings.
  • Improved data extraction accuracy and reliability for researcher information.

Pending Tasks:

  • Further testing of the updated scripts on different HTML pages to ensure robustness.
  • Verification of the correct URL for all relevant web pages to prevent future errors.