Optimized data storage using Python and Pandas

Day: 2023-03-29
Time: 21:00 to 21:20
Project: Dev
Workspace: WP 2: Operational
Status: Completed
Priority: MEDIUM
Assignee: Matías Nehuen Iglesias
Tags: Data Storage, Pandas, Categorical Data, JSON, Python

Description

Session Goal

The session aimed to explore and implement efficient data storage techniques using Python and Pandas, focusing on optimizing memory usage and data processing efficiency.

Key Activities

Discussed three methods for efficient data storage: sparse matrix, pivot table, and database usage for large datasets.
Explored converting ‘variable’ and ‘year’ columns to categorical data types in Pandas to improve memory efficiency.
Provided code snippets for converting DataFrame columns to categorical types and ensuring data type retention post-processing.
Demonstrated saving a Pandas DataFrame to a JSON file, considering dataset size and file type efficiency.

Achievements

Successfully implemented categorical conversion for specific DataFrame columns, enhancing data processing efficiency.
Developed a solution for retaining category data types after DataFrame processing.
Achieved efficient data storage by saving DataFrame as JSON, organized by variable and year.

Pending Tasks

Further exploration of Apache Parquet for large dataset storage to enhance performance.

Evidence

source_file=2023-03-29.sessions.jsonl, line_number=3, event_count=0, session_id=0d6b86b9c0034874a019c168944231fcc814d08c20a3a342db1a0909f7520097
event_ids: []

M.I. Journal

Journal Entries

Frequent Keywords

Optimized data storage using Python and Pandas

Optimized data storage using Python and Pandas

Description

Session Goal

Key Activities

Achievements

Pending Tasks

Evidence

Graph View

Table of Contents

Backlinks