📅 2024-07-11 — Session: Developed and Enhanced Web Scraping and NLP Workflows
🕒 20:00–20:45
🏷️ Labels: Web Scraping, NLP, Entity Recognition, Python, Ssl Handling
📂 Project: Dev
⭐ Priority: MEDIUM
Session Goal
The primary goal of this session was to develop and enhance workflows for web scraping and natural language processing (NLP), focusing on robust data extraction and improved entity recognition.
Key Activities
- Web Scraping Workflow: Designed comprehensive workflows for scraping webpages, extracting key information, and recognizing entities using tools like BeautifulSoup, Scrapy, spaCy, and Pandas.
- Data Processing: Implemented structured workflows for processing URLs, extracting data, and storing it in structured formats using SQLite and Pandas.
- Error Handling: Enhanced the
scrape_page
function to handle SSL errors and other request exceptions, improving the robustness of the scraping process. - HTML Cleaning: Updated the
clean_html
function to remove multiple spaces, tabs, and newlines, ensuring clean text output. - Entity Recognition: Improved entity recognition accuracy through preprocessing, custom model training, and post-processing techniques.
- Spanish NLP: Utilized spaCy with pre-trained Spanish models for entity extraction tasks.
Achievements
- Developed robust web scraping scripts with SSL and HTTP error handling.
- Enhanced entity recognition accuracy and data cleaning processes.
- Implemented workflows for text classification using machine learning models, including BERT and Scikit-Learn.
Pending Tasks
- Further testing and validation of the enhanced workflows and models in different environments.
- Exploration of additional NLP models and techniques for improved accuracy.