📅 2023-11-07 — Session: Implemented Web Scraping for Academic Data Extraction

🕒 22:05–23:40
🏷️ Labels: Web Scraping, Python, Beautifulsoup, Selenium, Google Scholar
📂 Project: Dev
⭐ Priority: MEDIUM

Session Goal

The session aimed to implement and troubleshoot web scraping methods for extracting academic paper information from Google Scholar and other academic databases.

Key Activities

  • Installed a .deb package on Debian-based systems and resolved dependencies.
  • Troubleshot JabRef command recognition issues and set up executable paths.
  • Explored various tools for gathering citation data, including Google Scholar, Web of Science, and Zotero.
  • Developed Python scripts for web scraping, focusing on HTML parsing using BeautifulSoup and handling pagination.
  • Addressed legal considerations and technical challenges in scraping Google Scholar.
  • Troubleshot Selenium WebDriver issues, including ChromeDriver version mismatches and WebDriver options errors.

Achievements

  • Successfully installed and configured JabRef on a Linux system.
  • Developed and refined Python scripts for extracting and structuring data from HTML content.
  • Addressed common web scraping challenges, including dynamic content loading and version mismatches in automation tools.

Pending Tasks

  • Further refine regex patterns and parsing logic for more accurate data extraction.
  • Explore alternative methods for accessing Google Scholar data legally and efficiently.