📅 2025-11-15 — Session: Designed and Implemented Knowledge Base for Academic Papers

🕒 16:55–17:25
🏷️ Labels: Knowledge Base, Vector Stores, Embeddings, Memory Management, Tokenization
📂 Project: Dev

Session Goal

The primary goal of this session was to design a comprehensive knowledge base for academic papers, transforming a small-institution paper series into a browsable and queryable format.

Key Activities

  • Developed a detailed execution plan for the knowledge base, focusing on user experience metaphors, data models, and AI workflows.
  • Researched vector stores and embedding practices, comparing options like FAISS, ChromaDB, Qdrant, Weaviate, and Milvus for semantic search capabilities.
  • Implemented scripts to calculate byte sizes and memory usage for vector dimensions, considering both uncompressed and compressed sizes using PQ compression.
  • Created a Python function for token chunking in document processing, demonstrating its application with sample data.
  • Compiled a master reference guide on embeddings and vector storage, tailored for managing a collection of approximately 1,000 papers, covering chunking strategies, index choices, and retrieval design.

Achievements

  • Successfully outlined the data model and AI workflow for the knowledge base.
  • Completed research and comparisons of vector stores, enhancing understanding of best practices for document embeddings.
  • Developed practical scripts for memory management and tokenization, aiding in efficient data processing.

Pending Tasks

  • Further refinement of the minimal viable product roadmap for the knowledge base.
  • Implementation of the designed workflows into a functional prototype.
  • Testing and validation of the knowledge base with real-world data.