📅 2024-07-11 — Session: Developed robust web scraping and NLP workflows

🕒 20:00–20:45
🏷️ Labels: Web Scraping, NLP, Error Handling, Entity Recognition, BERT
📂 Project: Dev
⭐ Priority: MEDIUM

Session Goal: The session aimed to enhance web scraping and natural language processing (NLP) workflows, focusing on robustness, error handling, and entity recognition accuracy.

Key Activities:

  • Developed a comprehensive workflow for web scraping and NLP, utilizing tools like BeautifulSoup, Scrapy, spaCy, and Pandas.
  • Implemented error handling in web scraping functions to manage SSL and HTTP exceptions, ensuring robustness.
  • Updated the clean_html function to improve data cleaning by removing unnecessary whitespace.
  • Enhanced entity recognition accuracy through preprocessing, custom model training, and post-processing techniques.
  • Explored the use of spaCy’s Spanish models for NLP tasks, including entity extraction.
  • Developed a structured approach for text classification using machine learning, specifically with Scikit-Learn and BERT models.

Achievements:

  • Successfully created robust web scraping scripts with comprehensive error handling.
  • Improved data cleaning processes and entity recognition workflows.
  • Established a foundation for text classification using advanced machine learning models.

Pending Tasks:

  • Further fine-tuning of the BERT-based text classification model for improved performance.
  • Integration of the enhanced NLP workflows into existing systems for real-world application.