Designed and Implemented Knowledge Base for Academic Papers

  • Day: 2025-11-15
  • Time: 16:55 to 17:25
  • Project: Dev
  • Workspace: WP 2: Operational
  • Status: In Progress
  • Priority: MEDIUM
  • Assignee: Matías Nehuen Iglesias
  • Tags: Knowledge Base, Vector Stores, Embeddings, Memory Management, Tokenization

Description

Session Goal

The primary goal of this session was to design a comprehensive knowledge base for academic papers, transforming a small-institution paper series into a browsable and queryable format.

Key Activities

  • Developed a detailed execution plan for the knowledge base, focusing on user experience metaphors, data models, and AI workflows.
  • Researched vector stores and embedding practices, comparing options like FAISS, ChromaDB, Qdrant, Weaviate, and Milvus for semantic search capabilities.
  • Implemented scripts to calculate byte sizes and memory usage for vector dimensions, considering both uncompressed and compressed sizes using PQ compression.
  • Created a Python function for token chunking in document processing, demonstrating its application with sample data.
  • Compiled a master reference guide on embeddings and vector storage, tailored for managing a collection of approximately 1,000 papers, covering chunking strategies, index choices, and retrieval design.

Achievements

  • Successfully outlined the data model and AI workflow for the knowledge base.
  • Completed research and comparisons of vector stores, enhancing understanding of best practices for document embeddings.
  • Developed practical scripts for memory management and tokenization, aiding in efficient data processing.

Pending Tasks

  • Further refinement of the minimal viable product roadmap for the knowledge base.
  • Implementation of the designed workflows into a functional prototype.
  • Testing and validation of the knowledge base with real-world data.

Evidence

  • source_file=2025-11-15.sessions.jsonl, line_number=2, event_count=0, session_id=9f9a589664a13cc17f7ff1516f5645db708d17e092c3c5a4dc60665f5c2b61a9
  • event_ids: []