Optimizing memory usage, Playground Series - Season 4, Episode 7

Optimizing Memory Usage with Insurance Cross-Sell Prediction

Introduction

This notebook goal is to demonstrate techniques for efficient handling large datasets, focusing on memory optimization without losing information. We use a health insurance cross-sell prediction dataset, that is a synthetic dataset create for the Kaggle Playground Competition of July 2024.

The main objectives are:

Change data types in a pandas DataFrame to reduce memory usage.
Converting CSV files to the more efficient Parquet format.

Setup and Data Loading

We start by importing necessary libraries and setting up our environment:

import pandas as pd
import numpy as np 
import os
from pathlib import Path

pd.options.display.max_columns = 300
pd.options.display.max_info_columns = 300

Define paths for our datasets, accommodating both Kaggle and local environments:

if 'KAGGLE_KERNEL_RUN_TYPE' in os.environ:
    print('Running on Kaggle') 
    # Path for the synthetic dataset
    path = Path('/kaggle/input/playground-series-s4e7/')
    # Add input first for the original dataset
    path_o = Path('/kaggle/input/health-insurance-cross-sell-prediction-data/')
    train_o_path = path_o/'train.csv'
    test_o_path = path_o/ 'test.csv'
else: 
    print('Running locally')
    path = Path('.')
    path_o = Path('.')
    train_o_path = path_o/'train_original.csv'
    test_o_path = path_o/ 'test_original.csv'

# Path for train, test
train_path = path/'train.csv'
test_path = path/'test.csv'

# Print the paths for verification
print(train_path)
print(test_path)
print(train_o_path)
print(test_o_path)

Running locally
train.csv
test.csv
train_original.csv
test_original.csv

Initial Data Exploration

We load our datasets and examine their characteristics:

# Synthetic dataset
train = pd.read_csv(train_path, index_col=0)
test = pd.read_csv(test_path, index_col=0)
# Original dataset
train_o = pd.read_csv(train_o_path, index_col=0)
test_o = pd.read_csv(test_o_path, index_col=0)

train.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Index: 11504798 entries, 0 to 11504797
Data columns (total 11 columns):
 #   Column                Dtype  
---  ------                -----  
 0   Gender                object 
 1   Age                   int64  
 2   Driving_License       int64  
 3   Region_Code           float64
 4   Previously_Insured    int64  
 5   Vehicle_Age           object 
 6   Vehicle_Damage        object 
 7   Annual_Premium        float64
 8   Policy_Sales_Channel  float64
 9   Vintage               int64  
 10  Response              int64  
dtypes: float64(3), int64(5), object(3)
memory usage: 2.8 GB

test.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Index: 7669866 entries, 11504798 to 19174663
Data columns (total 10 columns):
 #   Column                Dtype  
---  ------                -----  
 0   Gender                object 
 1   Age                   int64  
 2   Driving_License       int64  
 3   Region_Code           float64
 4   Previously_Insured    int64  
 5   Vehicle_Age           object 
 6   Vehicle_Damage        object 
 7   Annual_Premium        float64
 8   Policy_Sales_Channel  float64
 9   Vintage               int64  
dtypes: float64(3), int64(4), object(3)
memory usage: 1.8 GB

The training set has 11,504,797 entries with 10 columns.

The initial memory usage is substantial: 2.8 GB for the training set and 1.8 GB for the test set.

Memory Optimization Strategy

We implement a function to convert data types to more memory-efficient alternatives. We are going to follow the following list about datatypes with Pandas in order to avoid data loss:

int8: Ranges from -128 to 127.
int16: Ranges from -32,768 to 32,767.
int32: Ranges from -2,147,483,648 to 2,147,483,647.
int64: Ranges from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807.
float16 (Half-precision): Approximate decimal precision of 3 to 4 decimal digits.
float32 (Single-precision): Approximate decimal precision of 7 to 9 decimal digits.
float64 (Double-precision): Approximate decimal precision of 15 to 17 decimal digits.

train.describe().T

	count	mean	std	min	25%	50%	75%	max
Age	11504798.0	38.383563	14.993459	20.0	24.0	36.0	49.0	85.0
Driving_License	11504798.0	0.998022	0.044431	0.0	1.0	1.0	1.0	1.0
Region_Code	11504798.0	26.418690	12.991590	0.0	15.0	28.0	35.0	52.0
Previously_Insured	11504798.0	0.462997	0.498629	0.0	0.0	0.0	1.0	1.0
Annual_Premium	11504798.0	30461.370411	16454.745205	2630.0	25277.0	31824.0	39451.0	540165.0
Policy_Sales_Channel	11504798.0	112.425442	54.035708	1.0	29.0	151.0	152.0	163.0
Vintage	11504798.0	163.897744	79.979531	10.0	99.0	166.0	232.0	299.0
Response	11504798.0	0.122997	0.328434	0.0	0.0	0.0	0.0	1.0

def converting_datatypes(df):
    df = df.copy()
    try:
        # Converting data types
        df['Gender'] = df['Gender'].astype('category')
        df['Vehicle_Age'] = df['Vehicle_Age'].astype('category')
        df['Vehicle_Damage'] = df['Vehicle_Damage'].astype('category')
        df['Age'] = df['Age'].astype('int8')
        df['Driving_License'] = df['Driving_License'].astype('int8')
        df['Region_Code'] = df['Region_Code'].astype('int8')
        df['Previously_Insured'] = df['Previously_Insured'].astype('int8')
        df['Annual_Premium'] = df['Annual_Premium'].astype('int32')
        df['Policy_Sales_Channel'] = df['Policy_Sales_Channel'].astype('int16')
        df['Vintage'] = df['Vintage'].astype('int16')
        df['Response'] = df['Response'].astype('int8')
        print(df.info(memory_usage='deep'))
    except KeyError as e:
        print(f"Error: {e} not found in DataFrame")
    except Exception as e:
        print(f"An error occurred: {e}")
    return df

We apply this function to our datasets:

train = converting_datatypes(train)
test = converting_datatypes(test)
train_o = converting_datatypes(train_o)
test_o = converting_datatypes(test_o)

<class 'pandas.core.frame.DataFrame'>
Index: 11504798 entries, 0 to 11504797
Data columns (total 11 columns):
 #   Column                Dtype   
---  ------                -----   
 0   Gender                category
 1   Age                   int8    
 2   Driving_License       int8    
 3   Region_Code           int8    
 4   Previously_Insured    int8    
 5   Vehicle_Age           category
 6   Vehicle_Damage        category
 7   Annual_Premium        int32   
 8   Policy_Sales_Channel  int16   
 9   Vintage               int16   
 10  Response              int8    
dtypes: category(3), int16(2), int32(1), int8(5)
memory usage: 263.3 MB
None
Error: 'Response' not found in DataFrame
<class 'pandas.core.frame.DataFrame'>
Index: 381109 entries, 1 to 381109
Data columns (total 11 columns):
 #   Column                Non-Null Count   Dtype   
---  ------                --------------   -----   
 0   Gender                381109 non-null  category
 1   Age                   381109 non-null  int8    
 2   Driving_License       381109 non-null  int8    
 3   Region_Code           381109 non-null  int8    
 4   Previously_Insured    381109 non-null  int8    
 5   Vehicle_Age           381109 non-null  category
 6   Vehicle_Damage        381109 non-null  category
 7   Annual_Premium        381109 non-null  int32   
 8   Policy_Sales_Channel  381109 non-null  int16   
 9   Vintage               381109 non-null  int16   
 10  Response              381109 non-null  int8    
dtypes: category(3), int16(2), int32(1), int8(5)
memory usage: 8.7 MB
None
Error: 'Response' not found in DataFrame

Results of Memory Optimization

After applying our optimization strategy:

The training set’s memory usage reduced from 1.029 GB to 263.3 MB.
The test set’s memory usage reduced from 0.6286 GB to 168.2 MB.

This represents a significant reduction in memory usage, making the datasets more manageable for analysis and modeling besides the amount of data we are working with.

Saving Optimized Data

To preserve our memory-efficient datasets and enable faster loading in future, we save them in Parquet format:

# Save files as parquet for faster loading
train.to_parquet('train.parquet')
test.to_parquet('test.parquet')
train_o.to_parquet('train_original.parquet')
test_o.to_parquet('test_original.parquet')

Conclusion

We significantly reduced memory usage by optimizing data types, making large datasets more manageable. We also converted CSV files to Parquet format, which preserves data types and allows for faster data loading in future analyses.

These techniques are crucial when working with big data, as they allow for more efficient data processing and analysis, especially in environments with limited computational resources.