π 2023-03-08 β Session: Enhanced Web Scraping with BeautifulSoup
π 19:10β19:30
π·οΈ Labels: Web Scraping, Beautifulsoup, Python, Html Parsing
π Project: Dev
β Priority: MEDIUM
Session Goal:
The goal of this session was to refine and enhance web scraping capabilities using BeautifulSoup in Python, focusing on extracting researcher data from HTML pages.
Key Activities:
- Updated the CSS selector in BeautifulSoup code to use
:-soup-contains
instead of:contains
to avoid warnings. - Developed a Python function to scrape researcher and graduate student names, returning the data in a pandas DataFrame.
- Corrected a misspelled header tag from βReserchersβ to βResearchersβ in the web scraping function.
- Addressed encoding issues by specifying character encoding manually in BeautifulSoup to ensure proper data parsing.
- Provided guidance on the correct URL for the Image Processing and Computer Vision Groupβs webpage.
- Suggested a workaround for a misspelling in the HTML code that affected data extraction.
Achievements:
- Successfully updated and corrected web scraping scripts to handle CSS selector warnings, encoding issues, and HTML misspellings.
- Improved data extraction accuracy and reliability for researcher information.
Pending Tasks:
- Further testing of the updated scripts on different HTML pages to ensure robustness.
- Verification of the correct URL for all relevant web pages to prevent future errors.