📅 2023-11-07 — Session: Implemented Web Scraping for Academic Data Extraction
🕒 22:05–23:40
🏷️ Labels: Web Scraping, Python, Beautifulsoup, Selenium, Google Scholar
📂 Project: Dev
⭐ Priority: MEDIUM
Session Goal
The session aimed to implement and troubleshoot web scraping methods for extracting academic paper information from Google Scholar and other academic databases.
Key Activities
- Installed a .deb package on Debian-based systems and resolved dependencies.
- Troubleshot JabRef command recognition issues and set up executable paths.
- Explored various tools for gathering citation data, including Google Scholar, Web of Science, and Zotero.
- Developed Python scripts for web scraping, focusing on HTML parsing using BeautifulSoup and handling pagination.
- Addressed legal considerations and technical challenges in scraping Google Scholar.
- Troubleshot Selenium WebDriver issues, including ChromeDriver version mismatches and WebDriver options errors.
Achievements
- Successfully installed and configured JabRef on a Linux system.
- Developed and refined Python scripts for extracting and structuring data from HTML content.
- Addressed common web scraping challenges, including dynamic content loading and version mismatches in automation tools.
Pending Tasks
- Further refine regex patterns and parsing logic for more accurate data extraction.
- Explore alternative methods for accessing Google Scholar data legally and efficiently.