import pandas as pd
import numpy as np
import os
from pathlib import Path
pd.options.display.max_columns = 300
pd.options.display.max_info_columns = 300Optimizing memory usage, Playground Series - Season 4, Episode 7

Optimizing Memory Usage with Insurance Cross-Sell Prediction
Introduction
This notebook goal is to demonstrate techniques for efficient handling large datasets, focusing on memory optimization without losing information. We use a health insurance cross-sell prediction dataset, that is a synthetic dataset create for the Kaggle Playground Competition of July 2024.
The main objectives are:
- Change data types in a pandas DataFrame to reduce memory usage.
- Converting CSV files to the more efficient Parquet format.
Setup and Data Loading
We start by importing necessary libraries and setting up our environment:
Define paths for our datasets, accommodating both Kaggle and local environments:
if 'KAGGLE_KERNEL_RUN_TYPE' in os.environ:
print('Running on Kaggle')
# Path for the synthetic dataset
path = Path('/kaggle/input/playground-series-s4e7/')
# Add input first for the original dataset
path_o = Path('/kaggle/input/health-insurance-cross-sell-prediction-data/')
train_o_path = path_o/'train.csv'
test_o_path = path_o/ 'test.csv'
else:
print('Running locally')
path = Path('.')
path_o = Path('.')
train_o_path = path_o/'train_original.csv'
test_o_path = path_o/ 'test_original.csv'
# Path for train, test
train_path = path/'train.csv'
test_path = path/'test.csv'
# Print the paths for verification
print(train_path)
print(test_path)
print(train_o_path)
print(test_o_path)Running locally
train.csv
test.csv
train_original.csv
test_original.csv
Initial Data Exploration
We load our datasets and examine their characteristics:
# Synthetic dataset
train = pd.read_csv(train_path, index_col=0)
test = pd.read_csv(test_path, index_col=0)
# Original dataset
train_o = pd.read_csv(train_o_path, index_col=0)
test_o = pd.read_csv(test_o_path, index_col=0)train.info(memory_usage='deep')<class 'pandas.core.frame.DataFrame'>
Index: 11504798 entries, 0 to 11504797
Data columns (total 11 columns):
# Column Dtype
--- ------ -----
0 Gender object
1 Age int64
2 Driving_License int64
3 Region_Code float64
4 Previously_Insured int64
5 Vehicle_Age object
6 Vehicle_Damage object
7 Annual_Premium float64
8 Policy_Sales_Channel float64
9 Vintage int64
10 Response int64
dtypes: float64(3), int64(5), object(3)
memory usage: 2.8 GB
test.info(memory_usage='deep')<class 'pandas.core.frame.DataFrame'>
Index: 7669866 entries, 11504798 to 19174663
Data columns (total 10 columns):
# Column Dtype
--- ------ -----
0 Gender object
1 Age int64
2 Driving_License int64
3 Region_Code float64
4 Previously_Insured int64
5 Vehicle_Age object
6 Vehicle_Damage object
7 Annual_Premium float64
8 Policy_Sales_Channel float64
9 Vintage int64
dtypes: float64(3), int64(4), object(3)
memory usage: 1.8 GB
The training set has 11,504,797 entries with 10 columns.
The initial memory usage is substantial: 2.8 GB for the training set and 1.8 GB for the test set.
Memory Optimization Strategy
We implement a function to convert data types to more memory-efficient alternatives. We are going to follow the following list about datatypes with Pandas in order to avoid data loss:
- int8: Ranges from -128 to 127.
- int16: Ranges from -32,768 to 32,767.
- int32: Ranges from -2,147,483,648 to 2,147,483,647.
- int64: Ranges from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807.
- float16 (Half-precision): Approximate decimal precision of 3 to 4 decimal digits.
- float32 (Single-precision): Approximate decimal precision of 7 to 9 decimal digits.
- float64 (Double-precision): Approximate decimal precision of 15 to 17 decimal digits.
train.describe().T| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Age | 11504798.0 | 38.383563 | 14.993459 | 20.0 | 24.0 | 36.0 | 49.0 | 85.0 |
| Driving_License | 11504798.0 | 0.998022 | 0.044431 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| Region_Code | 11504798.0 | 26.418690 | 12.991590 | 0.0 | 15.0 | 28.0 | 35.0 | 52.0 |
| Previously_Insured | 11504798.0 | 0.462997 | 0.498629 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| Annual_Premium | 11504798.0 | 30461.370411 | 16454.745205 | 2630.0 | 25277.0 | 31824.0 | 39451.0 | 540165.0 |
| Policy_Sales_Channel | 11504798.0 | 112.425442 | 54.035708 | 1.0 | 29.0 | 151.0 | 152.0 | 163.0 |
| Vintage | 11504798.0 | 163.897744 | 79.979531 | 10.0 | 99.0 | 166.0 | 232.0 | 299.0 |
| Response | 11504798.0 | 0.122997 | 0.328434 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
def converting_datatypes(df):
df = df.copy()
try:
# Converting data types
df['Gender'] = df['Gender'].astype('category')
df['Vehicle_Age'] = df['Vehicle_Age'].astype('category')
df['Vehicle_Damage'] = df['Vehicle_Damage'].astype('category')
df['Age'] = df['Age'].astype('int8')
df['Driving_License'] = df['Driving_License'].astype('int8')
df['Region_Code'] = df['Region_Code'].astype('int8')
df['Previously_Insured'] = df['Previously_Insured'].astype('int8')
df['Annual_Premium'] = df['Annual_Premium'].astype('int32')
df['Policy_Sales_Channel'] = df['Policy_Sales_Channel'].astype('int16')
df['Vintage'] = df['Vintage'].astype('int16')
df['Response'] = df['Response'].astype('int8')
print(df.info(memory_usage='deep'))
except KeyError as e:
print(f"Error: {e} not found in DataFrame")
except Exception as e:
print(f"An error occurred: {e}")
return dfWe apply this function to our datasets:
train = converting_datatypes(train)
test = converting_datatypes(test)
train_o = converting_datatypes(train_o)
test_o = converting_datatypes(test_o)<class 'pandas.core.frame.DataFrame'>
Index: 11504798 entries, 0 to 11504797
Data columns (total 11 columns):
# Column Dtype
--- ------ -----
0 Gender category
1 Age int8
2 Driving_License int8
3 Region_Code int8
4 Previously_Insured int8
5 Vehicle_Age category
6 Vehicle_Damage category
7 Annual_Premium int32
8 Policy_Sales_Channel int16
9 Vintage int16
10 Response int8
dtypes: category(3), int16(2), int32(1), int8(5)
memory usage: 263.3 MB
None
Error: 'Response' not found in DataFrame
<class 'pandas.core.frame.DataFrame'>
Index: 381109 entries, 1 to 381109
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 381109 non-null category
1 Age 381109 non-null int8
2 Driving_License 381109 non-null int8
3 Region_Code 381109 non-null int8
4 Previously_Insured 381109 non-null int8
5 Vehicle_Age 381109 non-null category
6 Vehicle_Damage 381109 non-null category
7 Annual_Premium 381109 non-null int32
8 Policy_Sales_Channel 381109 non-null int16
9 Vintage 381109 non-null int16
10 Response 381109 non-null int8
dtypes: category(3), int16(2), int32(1), int8(5)
memory usage: 8.7 MB
None
Error: 'Response' not found in DataFrame
Results of Memory Optimization
After applying our optimization strategy:
- The training set’s memory usage reduced from 1.029 GB to 263.3 MB.
- The test set’s memory usage reduced from 0.6286 GB to 168.2 MB.
This represents a significant reduction in memory usage, making the datasets more manageable for analysis and modeling besides the amount of data we are working with.
Saving Optimized Data
To preserve our memory-efficient datasets and enable faster loading in future, we save them in Parquet format:
# Save files as parquet for faster loading
train.to_parquet('train.parquet')
test.to_parquet('test.parquet')
train_o.to_parquet('train_original.parquet')
test_o.to_parquet('test_original.parquet')Conclusion
We significantly reduced memory usage by optimizing data types, making large datasets more manageable. We also converted CSV files to Parquet format, which preserves data types and allows for faster data loading in future analyses.
These techniques are crucial when working with big data, as they allow for more efficient data processing and analysis, especially in environments with limited computational resources.