Optimized data storage using Python and Pandas

  • Day: 2023-03-29
  • Time: 21:00 to 21:20
  • Project: Dev
  • Workspace: WP 2: Operational
  • Status: Completed
  • Priority: MEDIUM
  • Assignee: Matías Nehuen Iglesias
  • Tags: Data Storage, Pandas, Categorical Data, JSON, Python

Description

Session Goal

The session aimed to explore and implement efficient data storage techniques using Python and Pandas, focusing on optimizing memory usage and data processing efficiency.

Key Activities

  • Discussed three methods for efficient data storage: sparse matrix, pivot table, and database usage for large datasets.
  • Explored converting ‘variable’ and ‘year’ columns to categorical data types in Pandas to improve memory efficiency.
  • Provided code snippets for converting DataFrame columns to categorical types and ensuring data type retention post-processing.
  • Demonstrated saving a Pandas DataFrame to a JSON file, considering dataset size and file type efficiency.

Achievements

  • Successfully implemented categorical conversion for specific DataFrame columns, enhancing data processing efficiency.
  • Developed a solution for retaining category data types after DataFrame processing.
  • Achieved efficient data storage by saving DataFrame as JSON, organized by variable and year.

Pending Tasks

  • Further exploration of Apache Parquet for large dataset storage to enhance performance.

Evidence

  • source_file=2023-03-29.sessions.jsonl, line_number=3, event_count=0, session_id=0d6b86b9c0034874a019c168944231fcc814d08c20a3a342db1a0909f7520097
  • event_ids: []