πŸ“… 2023-04-16 β€” Session: Implemented robust web scraping with BeautifulSoup

πŸ•’ 20:40–21:05
🏷️ Labels: Python, Beautifulsoup, Web Scraping, Error Handling, Dataframe
πŸ“‚ Project: Dev
⭐ Priority: MEDIUM

Session Goal: The goal of this session was to implement a robust web scraping solution using BeautifulSoup to extract news articles from a specified website and handle potential errors effectively.

Key Activities:

  • Developed a Python script using BeautifulSoup and pandas to scrape news articles and store them in a DataFrame.
  • Implemented error handling to manage missing data for article titles, sections, authors, and dates by assigning None when values are not available.
  • Used try-except blocks to handle cases where article titles or dates may be missing.
  • Integrated conditional checks to avoid errors when accessing attributes in BeautifulSoup.
  • Added a column for β€˜href’ in the DataFrame and set default values when elements are missing.
  • Explored the use of Selenium for scraping dynamically loaded content and discussed legal and ethical considerations.
  • Addressed potential gaierror issues by ensuring valid URLs are used in web scraping.

Achievements:

  • Successfully created a web scraping script with comprehensive error handling and data integration into a pandas DataFrame.
  • Enhanced understanding of handling dynamic content with Selenium.

Pending Tasks:

  • Further exploration of Selenium for dynamic content scraping and legal considerations.
  • Testing the script on different websites to ensure robustness.