📅 2023-04-16 — Session: Developed and Debugged Web Scraping Scripts for News Extraction
🕒 19:35–20:15
🏷️ Labels: Web Scraping, Python, Beautifulsoup, Github, Regular Expressions
📂 Project: Dev
⭐ Priority: MEDIUM
Session Goal
The session aimed to develop and debug Python scripts for web scraping news sources, focusing on extracting names, URLs, and logos of news stations, as well as analyzing HTML page structures.
Key Activities
- Troubleshooting GitHub Push Authentication: Addressed issues related to GitHub push authentication by verifying Git credential setup and error verification.
- Automating Keyword Search: Explored methods for automating keyword searches in news articles using web scraping, NLP, and machine learning.
- Python Web Scraper Development: Implemented Python scripts using BeautifulSoup and Pandas to scrape news sources and extract relevant data.
- Error Fixes: Resolved errors related to HTML parsing and logo extraction in web scraping scripts.
- Domain Name Extraction: Utilized regular expressions to extract domain names from URLs.
- Regular Expressions Insight: Provided insights into using capturing and non-capturing groups in regular expressions.
- Web Crawler Overview: Discussed ethical considerations and libraries for web crawling.
- HTML Structure Analysis: Analyzed common structural elements in HTML pages.
Achievements
- Successfully developed and debugged multiple Python scripts for web scraping tasks.
- Enhanced understanding of regular expressions and ethical web scraping practices.
- Gained insights into HTML structure and metadata roles.
Pending Tasks
- Further automate keyword search processes using advanced NLP and machine learning techniques.
- Explore additional ethical considerations and legal aspects of web scraping.