Enhanced Web Scraping Scripts for Student Data

  • Day: 2024-08-01
  • Time: 22:30 to 23:55
  • Project: Dev
  • Workspace: WP 2: Operational
  • Status: Completed
  • Priority: MEDIUM
  • Assignee: Matías Nehuen Iglesias
  • Tags: Python, Selenium, Web Scraping, Data Extraction, Beautifulsoup

Description

Session Goal

The goal of this session was to develop and refine Python scripts for web scraping student data using Selenium and BeautifulSoup.

Key Activities

  • Developed a Python script utilizing Selenium and BeautifulSoup to extract student information from web pages, storing data in pandas DataFrames while avoiding duplicates based on URL IDs.
  • Modified Selenium scripts to manage browser sessions and tabs effectively, enhancing error handling to improve script robustness.
  • Implemented changes to handle empty tables and deprecated warnings, optimizing DataFrame concatenation using pd.concat instead of append.
  • Updated scripts to print HTML structures using BeautifulSoup’s prettify method and ensured proper page loading with error handling mechanisms.

Achievements

  • Successfully created and refined multiple scripts for extracting and processing student data from web pages.
  • Improved error handling and session management in Selenium scripts, increasing the stability and reliability of the scraping process.
  • Optimized data handling in pandas, ensuring efficient data manipulation and storage.

Pending Tasks

  • Further testing of scripts in diverse web environments to ensure robustness across different scenarios.
  • Continuous monitoring and adjustment of scripts to accommodate any changes in web page structures or technologies.

Evidence

  • source_file=2024-08-01.sessions.jsonl, line_number=2, event_count=0, session_id=995a8361e14cc97fa3e3fa67e103518dc2f5414272b73f860e46f793ea471eca
  • event_ids: []