Integrated Graph-Based Search with NLP Enhancements
- Day: 2025-02-17
- Time: 17:10 to 17:40
- Project: Dev
- Workspace: WP 2: Operational
- Status: In Progress
- Priority: MEDIUM
- Assignee: Matías Nehuen Iglesias
- Tags: Graph Search, Cassandra, NLP, Text Classification, Knowledge Graph
Description
Session Goal
The session aimed to explore the integration of graph-based search systems with Cassandra storage and enhance data processing using NLP techniques.
Key Activities
- Graph-Based Search and Cassandra Storage: Discussed the structure and functionality of a graph-based search system integrated with Cassandra for efficient storage and retrieval of text chunks, focusing on metadata, embeddings, and document relationships.
- Data Structure Enhancements: Reviewed the current data structure, proposing new analysis tables to improve data processing while maintaining ID compatibility.
- Knowledge Aggregation: Explored methods for aggregating micro pieces of knowledge into higher-level units, discussing strategies for storage and integration.
- NLP Annotations to Knowledge Web: Outlined steps to transform NLP annotations into a knowledge web, including preprocessing, classification, and graph construction.
- SOTA Models for Text Classification: Summarized state-of-the-art models from Hugging Face for text classification, providing recommendations based on language and task requirements.
- Script Plan for Text Classification: Developed a high-level plan for a script using Hugging Face’s all-MiniLM-L6-v2 model to generate embeddings and cluster 35,000 text chunks, integrating results into a knowledge graph.
Achievements
- Clarified the integration process of graph-based search with Cassandra.
- Proposed enhancements to the current data structure for better data management.
- Identified state-of-the-art NLP models suitable for various text classification tasks.
Pending Tasks
- Implement the proposed data structure enhancements.
- Execute the script plan for text chunk classification and integration into the knowledge graph.
Evidence
- source_file=2025-02-17.sessions.jsonl, line_number=7, event_count=0, session_id=2eebfecf6ccccc36e26a3f619f3ad73a13e1ca3cf406d4b0a204bd26ab688939
- event_ids: []