πŸ“… 2023-03-29 β€” Session: Optimized data storage using Python and Pandas

πŸ•’ 21:00–21:20
🏷️ Labels: Data Storage, Pandas, Categorical Data, JSON, Python
πŸ“‚ Project: Dev
⭐ Priority: MEDIUM

Session Goal

The session aimed to explore and implement efficient data storage techniques using Python and Pandas, focusing on optimizing memory usage and data processing efficiency.

Key Activities

  • Discussed three methods for efficient data storage: sparse matrix, pivot table, and database usage for large datasets.
  • Explored converting β€˜variable’ and β€˜year’ columns to categorical data types in Pandas to improve memory efficiency.
  • Provided code snippets for converting DataFrame columns to categorical types and ensuring data type retention post-processing.
  • Demonstrated saving a Pandas DataFrame to a JSON file, considering dataset size and file type efficiency.

Achievements

  • Successfully implemented categorical conversion for specific DataFrame columns, enhancing data processing efficiency.
  • Developed a solution for retaining category data types after DataFrame processing.
  • Achieved efficient data storage by saving DataFrame as JSON, organized by variable and year.

Pending Tasks

  • Further exploration of Apache Parquet for large dataset storage to enhance performance.