Working with Chunks in Pandas, PyArrow, and Parquet

Chunks in Pandas, PyArrow, and Parquet - Amex Competition

# libraries
import pandas as pd
import numpy as np 
import psutil
import gc

import pyarrow.parquet as pq
import pyarrow as pa

pd.options.display.max_columns = 200
pd.options.display.max_rows = 200
pd.options.display.max_info_columns = 300

def available_memory_gb():
    return psutil.virtual_memory().available / (1024**3)

Introduction - The AMEX Default Prediction Competition

The AMEX default prediction competition in Kaggle is a binary classification problem where the goal is to predict whether a credit card holder will default on their payment within 120 days after the latest credit card statement based on data from a 18-month performance window after the latest credit card statement.

The first step in any data science project is to load the data and understand its structure. The problem here is that the dataset is 50GB in size and it is not possible to load it all into memory at once (at least not in my machine). So here is where headaches start to appear.

In this notebook, I will show how to load the data in chunks using Pandas and PyArrow, and how to save it in Parquet format. This way, we can load the data in chunks and avoid memory issues.

You don’t need to load the data all at once

The first thing to understand is that you don’t need to load the data all at once. You can have a first glance to the data by loading only a small sample of it. This will give you an idea of the data structure and help you to decide what to do next.

PATH = '/home/jmanu/git/amex_competition/train_data.csv'
data = pd.read_csv(PATH, nrows=35000)
data.head(5)

	customer_ID	S_2	P_2	D_39	B_1	B_2	R_1	S_3	D_41	B_3	D_42	D_43	D_44	B_4	D_45	B_5	R_2	D_46	D_47	D_48	D_49	B_6	B_7	B_8	D_50	D_51	B_9	R_3	D_52	P_3	B_10	D_53	S_5	B_11	S_6	D_54	R_4	S_7	B_12	S_8	D_55	D_56	B_13	R_5	D_58	S_9	B_14	D_59	D_60	D_61	B_15	S_11	D_62	D_63	D_64	D_65	B_16	B_17	B_18	B_19	D_66	B_20	D_68	S_12	R_6	S_13	B_21	D_69	B_22	D_70	D_71	D_72	S_15	B_23	D_73	P_4	D_74	D_75	D_76	B_24	R_7	D_77	B_25	B_26	D_78	D_79	R_8	R_9	S_16	D_80	R_10	R_11	B_27	D_81	D_82	S_17	R_12	B_28	R_13	D_83	R_14	R_15	D_84	R_16	B_29	S_18	D_86	D_87	R_17	R_18	D_88	B_31	S_19	R_19	B_32	S_20	R_20	R_21	B_33	D_89	R_22	R_23	D_91	D_92	D_93	D_94	R_24	R_25	D_96	S_22	S_23	S_24	S_25	S_26	D_102	D_103	D_104	D_105	D_106	D_107	B_36	B_37	R_26	R_27	B_38	D_108	D_109	D_110	D_111	B_39	D_112	B_40	S_27	D_113	D_114	D_115	D_117	D_118	D_119	D_121	D_122	D_123	D_124	D_125	D_126	D_127	D_128	D_129	B_41	B_42	D_130	D_131	D_132	D_133	R_28	D_134	D_135	D_136	D_137	D_138	D_139	D_140	D_141	D_142	D_143	D_144	D_145
0	0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...	2017-03-09	0.938469	0.001733	0.008724	1.006838	0.009228	0.124035	0.008771	0.004709	NaN	NaN	0.000630	0.080986	0.708906	0.170600	0.006204	0.358587	0.525351	0.255736	NaN	0.063902	0.059416	0.006466	0.148698	1.335856	0.008207	0.001423	0.207334	0.736463	0.096219	NaN	0.023381	0.002768	0.008322	1.001519	0.008298	0.161345	0.148266	0.922998	0.354596	0.152025	0.118075	0.001882	0.158612	0.065728	0.018385	0.063646	0.199617	0.308233	0.016361	0.401619	0.091071	CR	O	0.007126	0.007665	NaN	0.652984	0.008520	NaN	0.004730	6.0	0.272008	0.008363	0.515222	0.002644	0.009013	0.004808	0.008342	0.119403	0.004802	0.108271	0.050882	NaN	0.007554	0.080422	0.069067	NaN	0.004327	0.007562	NaN	0.007729	0.000272	0.001576	0.004239	0.001434	NaN	0.002271	0.004061	0.007121	0.002456	0.002310	0.003532	0.506612	0.008033	1.009825	0.084683	0.003820	0.007043	0.000438	0.006452	0.000830	0.005055	NaN	0.005720	0.007084	NaN	0.000198	0.008907	NaN	1	0.002537	0.005177	0.006626	0.009705	0.007782	0.002450	1.001101	0.002665	0.007479	0.006893	1.503673	1.006133	0.003569	0.008871	0.003950	0.003647	0.004950	0.894090	0.135561	0.911191	0.974539	0.001243	0.766688	1.008691	1.004587	0.893734	NaN	0.670041	0.009968	0.004572	NaN	1.008949	2.0	NaN	0.004326	NaN	NaN	NaN	1.007336	0.210060	0.676922	0.007871	1.0	0.238250	4.0	0.232120	0.236266	0.702280	0.434345	0.003057	0.686516	0.008740	1.0	1.003319	1.007819	1.000080	0.006805	NaN	0.002052	0.005972	NaN	0.004345	0.001535	NaN	NaN	NaN	NaN	NaN	0.002427	0.003706	0.003818	NaN	0.000569	0.000610	0.002674
1	0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...	2017-04-07	0.936665	0.005775	0.004923	1.000653	0.006151	0.126750	0.000798	0.002714	NaN	NaN	0.002526	0.069419	0.712795	0.113239	0.006206	0.353630	0.521311	0.223329	NaN	0.065261	0.057744	0.001614	0.149723	1.339794	0.008373	0.001984	0.202778	0.720886	0.099804	NaN	0.030599	0.002749	0.002482	1.009033	0.005136	0.140951	0.143530	0.919414	0.326757	0.156201	0.118737	0.001610	0.148459	0.093935	0.013035	0.065501	0.151387	0.265026	0.017688	0.406326	0.086805	CR	O	0.002413	0.007148	NaN	0.647093	0.002238	NaN	0.003879	6.0	0.188970	0.004030	0.509048	0.004193	0.007842	0.001283	0.006524	0.140611	0.000094	0.101018	0.040469	NaN	0.004832	0.081413	0.074166	NaN	0.004203	0.005304	NaN	0.001864	0.000979	0.009896	0.007597	0.000509	NaN	0.009810	0.000127	0.005966	0.000395	0.001327	0.007773	0.500855	0.000760	1.009461	0.081843	0.000347	0.007789	0.004311	0.002332	0.009469	0.003753	NaN	0.007584	0.006677	NaN	0.001142	0.005907	NaN	1	0.008427	0.008979	0.001854	0.009924	0.005987	0.002247	1.006779	0.002508	0.006827	0.002837	1.503577	1.005791	0.000571	0.000391	0.008351	0.008850	0.003180	0.902135	0.136333	0.919876	0.975624	0.004561	0.786007	1.000084	1.004118	0.906841	NaN	0.668647	0.003921	0.004654	NaN	1.003205	2.0	NaN	0.008707	NaN	NaN	NaN	1.007653	0.184093	0.822281	0.003444	1.0	0.247217	4.0	0.243532	0.241885	0.707017	0.430501	0.001306	0.686414	0.000755	1.0	1.008394	1.004333	1.008344	0.004407	NaN	0.001034	0.004838	NaN	0.007495	0.004931	NaN	NaN	NaN	NaN	NaN	0.003954	0.003167	0.005032	NaN	0.009576	0.005492	0.009217
2	0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...	2017-05-28	0.954180	0.091505	0.021655	1.009672	0.006815	0.123977	0.007598	0.009423	NaN	NaN	0.007605	0.068839	0.720884	0.060492	0.003259	0.334650	0.524568	0.189424	NaN	0.066982	0.056647	0.005126	0.151955	1.337179	0.009355	0.007426	0.206629	0.738044	0.134073	NaN	0.048367	0.010077	0.000530	1.009184	0.006961	0.112229	0.137014	1.001977	0.304124	0.153795	0.114534	0.006328	0.139504	0.084757	0.056653	0.070607	0.305883	0.212165	0.063955	0.406768	0.094001	CR	O	0.001878	0.003636	NaN	0.645819	0.000408	NaN	0.004578	6.0	0.495308	0.006838	0.679257	0.001337	0.006025	0.009393	0.002615	0.075868	0.007152	0.103239	0.047454	NaN	0.006561	0.078891	0.076510	NaN	0.001782	0.001422	NaN	0.005419	0.006149	0.009629	0.003094	0.008295	NaN	0.009362	0.000954	0.005447	0.007345	0.007624	0.008811	0.504606	0.004056	1.004291	0.081954	0.002709	0.004093	0.007139	0.008358	0.002325	0.007381	NaN	0.005901	0.001185	NaN	0.008013	0.008882	NaN	1	0.007327	0.002016	0.008686	0.008446	0.007291	0.007794	1.001014	0.009634	0.009820	0.005080	1.503359	1.005801	0.007425	0.009234	0.002471	0.009769	0.005433	0.939654	0.134938	0.958699	0.974067	0.011736	0.806840	1.003014	1.009285	0.928719	NaN	0.670901	0.001264	0.019176	NaN	1.000754	2.0	NaN	0.004092	NaN	NaN	NaN	1.004312	0.154837	0.853498	0.003269	1.0	0.239867	4.0	0.240768	0.239710	0.704843	0.434409	0.003954	0.690101	0.009617	1.0	1.009307	1.007831	1.006878	0.003221	NaN	0.005681	0.005497	NaN	0.009227	0.009123	NaN	NaN	NaN	NaN	NaN	0.003269	0.007329	0.000427	NaN	0.003429	0.006986	0.002603
3	0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...	2017-06-13	0.960384	0.002455	0.013683	1.002700	0.001373	0.117169	0.000685	0.005531	NaN	NaN	0.006406	0.055630	0.723997	0.166782	0.009918	0.323271	0.530929	0.135586	NaN	0.083720	0.049253	0.001418	0.151219	1.339909	0.006782	0.003515	0.208214	0.741813	0.134437	NaN	0.030063	0.009667	0.000783	1.007456	0.008706	0.102838	0.129017	0.704016	0.275055	0.155772	0.120740	0.004980	0.138100	0.048382	0.012498	0.065926	0.273553	0.204300	0.022732	0.405175	0.094854	CR	O	0.005899	0.005896	NaN	0.654358	0.005897	NaN	0.005207	6.0	0.508670	0.008183	0.515282	0.008716	0.005271	0.004554	0.002052	0.150209	0.005364	0.206394	0.031705	NaN	0.009559	0.077490	0.071547	NaN	0.005595	0.006363	NaN	0.000646	0.009193	0.008568	0.003895	0.005153	NaN	0.004876	0.005665	0.001888	0.004961	0.000034	0.004652	0.508998	0.006969	1.004728	0.060634	0.009982	0.008817	0.008690	0.007364	0.005924	0.008802	NaN	0.002520	0.003324	NaN	0.009455	0.008348	NaN	1	0.007053	0.003909	0.002478	0.006614	0.009977	0.007686	1.002775	0.007791	0.000458	0.007320	1.503701	1.007036	0.000664	0.003200	0.008507	0.004858	0.000063	0.913205	0.140058	0.926341	0.975499	0.007571	0.808214	1.001517	1.004514	0.935383	NaN	0.672620	0.002729	0.011720	NaN	1.005338	2.0	NaN	0.009703	NaN	NaN	NaN	1.002538	0.153939	0.844667	0.000053	1.0	0.240910	4.0	0.239400	0.240727	0.711546	0.436903	0.005135	0.687779	0.004649	1.0	1.001671	1.003460	1.007573	0.007703	NaN	0.007108	0.008261	NaN	0.007206	0.002409	NaN	NaN	NaN	NaN	NaN	0.006117	0.004516	0.003200	NaN	0.008419	0.006527	0.009600
4	0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...	2017-07-16	0.947248	0.002483	0.015193	1.000727	0.007605	0.117325	0.004653	0.009312	NaN	NaN	0.007731	0.038862	0.720619	0.143630	0.006667	0.231009	0.529305	NaN	NaN	0.075900	0.048918	0.001199	0.154026	1.341735	0.000519	0.001362	0.205468	0.691986	0.121518	NaN	0.054221	0.009484	0.006698	1.003738	0.003846	0.094311	0.129539	0.917133	0.231110	0.154914	0.095178	0.001653	0.126443	0.039259	0.027897	0.063697	0.233103	0.175655	0.031171	0.487460	0.093915	CR	O	0.009479	0.001714	NaN	0.650112	0.007773	NaN	0.005851	6.0	0.216507	0.008605	0.507712	0.006821	0.000152	0.000104	0.001419	0.096441	0.007972	0.106020	0.032733	NaN	0.008156	0.076561	0.074432	NaN	0.004933	0.004831	NaN	0.001833	0.005738	0.003289	0.002608	0.007338	NaN	0.007447	0.004465	0.006111	0.002246	0.002109	0.001141	0.506213	0.001770	1.000904	0.062492	0.005860	0.001845	0.007816	0.002470	0.005516	0.007166	NaN	0.000155	0.001504	NaN	0.002019	0.002678	NaN	1	0.007728	0.003432	0.002199	0.005511	0.004105	0.009656	1.006536	0.005158	0.003341	0.000264	1.509905	1.002915	0.003079	0.003845	0.007190	0.002983	0.000535	0.921026	0.131620	0.933479	0.978027	0.018200	0.822281	1.006125	1.005735	0.953363	NaN	0.673869	0.009998	0.017598	NaN	1.003175	2.0	NaN	0.009120	NaN	NaN	NaN	1.000130	0.120717	0.811199	0.008724	1.0	0.247939	4.0	0.244199	0.242325	0.705343	0.437433	0.002849	0.688774	0.000097	1.0	1.009886	1.005053	1.008132	0.009823	NaN	0.009680	0.004848	NaN	0.006312	0.004462	NaN	NaN	NaN	NaN	NaN	0.003671	0.004946	0.008889	NaN	0.001670	0.008126	0.009827

Using the info() method, you can have a better sense of how the dataset is structured.

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35000 entries, 0 to 34999
Data columns (total 190 columns):
 #    Column       Non-Null Count  Dtype  
---   ------       --------------  -----  
 0    customer_ID  35000 non-null  object 
 1    S_2          35000 non-null  object 
 2    P_2          34690 non-null  float64
 3    D_39         35000 non-null  float64
 4    B_1          35000 non-null  float64
 5    B_2          34994 non-null  float64
 6    R_1          35000 non-null  float64
 7    S_3          28053 non-null  float64
 8    D_41         34994 non-null  float64
 9    B_3          34994 non-null  float64
 10   D_42         4889 non-null   float64
 11   D_43         24217 non-null  float64
 12   D_44         33279 non-null  float64
 13   B_4          35000 non-null  float64
 14   D_45         34994 non-null  float64
 15   B_5          35000 non-null  float64
 16   R_2          35000 non-null  float64
 17   D_46         27411 non-null  float64
 18   D_47         35000 non-null  float64
 19   D_48         30448 non-null  float64
 20   D_49         3808 non-null   float64
 21   B_6          34999 non-null  float64
 22   B_7          35000 non-null  float64
 23   B_8          34833 non-null  float64
 24   D_50         15101 non-null  float64
 25   D_51         35000 non-null  float64
 26   B_9          35000 non-null  float64
 27   R_3          35000 non-null  float64
 28   D_52         34832 non-null  float64
 29   P_3          33203 non-null  float64
 30   B_10         35000 non-null  float64
 31   D_53         9055 non-null   float64
 32   S_5          35000 non-null  float64
 33   B_11         35000 non-null  float64
 34   S_6          35000 non-null  float64
 35   D_54         34994 non-null  float64
 36   R_4          35000 non-null  float64
 37   S_7          28053 non-null  float64
 38   B_12         35000 non-null  float64
 39   S_8          35000 non-null  float64
 40   D_55         33844 non-null  float64
 41   D_56         16107 non-null  float64
 42   B_13         34687 non-null  float64
 43   R_5          35000 non-null  float64
 44   D_58         35000 non-null  float64
 45   S_9          16298 non-null  float64
 46   B_14         35000 non-null  float64
 47   D_59         34374 non-null  float64
 48   D_60         35000 non-null  float64
 49   D_61         31232 non-null  float64
 50   B_15         34974 non-null  float64
 51   S_11         35000 non-null  float64
 52   D_62         30170 non-null  float64
 53   D_63         35000 non-null  object 
 54   D_64         33718 non-null  object 
 55   D_65         35000 non-null  float64
 56   B_16         34994 non-null  float64
 57   B_17         15208 non-null  float64
 58   B_18         35000 non-null  float64
 59   B_19         34994 non-null  float64
 60   D_66         3888 non-null   float64
 61   B_20         34994 non-null  float64
 62   D_68         33738 non-null  float64
 63   S_12         35000 non-null  float64
 64   R_6          35000 non-null  float64
 65   S_13         35000 non-null  float64
 66   B_21         35000 non-null  float64
 67   D_69         33846 non-null  float64
 68   B_22         34994 non-null  float64
 69   D_70         34440 non-null  float64
 70   D_71         35000 non-null  float64
 71   D_72         34866 non-null  float64
 72   S_15         35000 non-null  float64
 73   B_23         35000 non-null  float64
 74   D_73         362 non-null    float64
 75   P_4          35000 non-null  float64
 76   D_74         34901 non-null  float64
 77   D_75         35000 non-null  float64
 78   D_76         4140 non-null   float64
 79   B_24         35000 non-null  float64
 80   R_7          35000 non-null  float64
 81   D_77         18900 non-null  float64
 82   B_25         34974 non-null  float64
 83   B_26         34994 non-null  float64
 84   D_78         33279 non-null  float64
 85   D_79         34582 non-null  float64
 86   R_8          35000 non-null  float64
 87   R_9          2087 non-null   float64
 88   S_16         35000 non-null  float64
 89   D_80         34901 non-null  float64
 90   R_10         35000 non-null  float64
 91   R_11         35000 non-null  float64
 92   B_27         34994 non-null  float64
 93   D_81         34844 non-null  float64
 94   D_82         9126 non-null   float64
 95   S_17         35000 non-null  float64
 96   R_12         35000 non-null  float64
 97   B_28         35000 non-null  float64
 98   R_13         35000 non-null  float64
 99   D_83         33846 non-null  float64
 100  R_14         35000 non-null  float64
 101  R_15         35000 non-null  float64
 102  D_84         34832 non-null  float64
 103  R_16         35000 non-null  float64
 104  B_29         2401 non-null   float64
 105  B_30         34994 non-null  float64
 106  S_18         35000 non-null  float64
 107  D_86         35000 non-null  float64
 108  D_87         35 non-null     float64
 109  R_17         35000 non-null  float64
 110  R_18         35000 non-null  float64
 111  D_88         45 non-null     float64
 112  B_31         35000 non-null  int64  
 113  S_19         35000 non-null  float64
 114  R_19         35000 non-null  float64
 115  B_32         35000 non-null  float64
 116  S_20         35000 non-null  float64
 117  R_20         34999 non-null  float64
 118  R_21         35000 non-null  float64
 119  B_33         34994 non-null  float64
 120  D_89         34832 non-null  float64
 121  R_22         35000 non-null  float64
 122  R_23         35000 non-null  float64
 123  D_91         34066 non-null  float64
 124  D_92         35000 non-null  float64
 125  D_93         35000 non-null  float64
 126  D_94         35000 non-null  float64
 127  R_24         35000 non-null  float64
 128  R_25         35000 non-null  float64
 129  D_96         35000 non-null  float64
 130  S_22         34872 non-null  float64
 131  S_23         34998 non-null  float64
 132  S_24         34874 non-null  float64
 133  S_25         34917 non-null  float64
 134  S_26         34996 non-null  float64
 135  D_102        34751 non-null  float64
 136  D_103        34403 non-null  float64
 137  D_104        34403 non-null  float64
 138  D_105        15816 non-null  float64
 139  D_106        3771 non-null   float64
 140  D_107        34403 non-null  float64
 141  B_36         35000 non-null  float64
 142  B_37         35000 non-null  float64
 143  R_26         3717 non-null   float64
 144  R_27         34198 non-null  float64
 145  B_38         34994 non-null  float64
 146  D_108        202 non-null    float64
 147  D_109        34991 non-null  float64
 148  D_110        180 non-null    float64
 149  D_111        180 non-null    float64
 150  B_39         188 non-null    float64
 151  D_112        34990 non-null  float64
 152  B_40         35000 non-null  float64
 153  S_27         25742 non-null  float64
 154  D_113        33955 non-null  float64
 155  D_114        33955 non-null  float64
 156  D_115        33955 non-null  float64
 157  D_116        33955 non-null  float64
 158  D_117        33955 non-null  float64
 159  D_118        33955 non-null  float64
 160  D_119        33955 non-null  float64
 161  D_120        33955 non-null  float64
 162  D_121        33955 non-null  float64
 163  D_122        33955 non-null  float64
 164  D_123        33955 non-null  float64
 165  D_124        33955 non-null  float64
 166  D_125        33955 non-null  float64
 167  D_126        34310 non-null  float64
 168  D_127        35000 non-null  float64
 169  D_128        34403 non-null  float64
 170  D_129        34403 non-null  float64
 171  B_41         34996 non-null  float64
 172  B_42         498 non-null    float64
 173  D_130        34403 non-null  float64
 174  D_131        34403 non-null  float64
 175  D_132        3781 non-null   float64
 176  D_133        34739 non-null  float64
 177  R_28         35000 non-null  float64
 178  D_134        1317 non-null   float64
 179  D_135        1317 non-null   float64
 180  D_136        1317 non-null   float64
 181  D_137        1317 non-null   float64
 182  D_138        1317 non-null   float64
 183  D_139        34403 non-null  float64
 184  D_140        34751 non-null  float64
 185  D_141        34403 non-null  float64
 186  D_142        5945 non-null   float64
 187  D_143        34403 non-null  float64
 188  D_144        34751 non-null  float64
 189  D_145        34403 non-null  float64
dtypes: float64(185), int64(1), object(4)
memory usage: 50.7+ MB

print(f'Number of data types object cols: {data.select_dtypes("object").columns.size}')
print(f'Number of data types int cols: {data.select_dtypes("int64").columns.size}')
print(f'Number of data types float cols: {data.select_dtypes("float64").columns.size}')

Number of data types object cols: 4
Number of data types int cols: 1
Number of data types float cols: 185

Data preprocessing

`customer_ID` hurts your eyes!

When you execute the head method in the dataframe, the customer_ID column hurts your eyes. It’s a column that store strings that appears to be a hexadecimal. This type of string typically represents binary data in a human-readable format, using the characters 0-9 and a-f (or A-F) to represent the values 0 through 15 in each digit. Given its length, it is likely that this column is a hash of some sort. Given its length (64 hexadecimal characters), this string is likely a SHA-256 hash.

This mean that each row uses 64 bytes of memory. So following Chris Deotte’s advice here and here, we are going to take the last 16 digits of the hexadecimal string and convert it to an integer.

data['customer_ID'] = data['customer_ID'].apply(lambda x: int(x[-16:], 16)).astype('int32')

From 11 categorical columns to 13 categorical columns

American Express has also provide a list of categorical features that we can convert to categorical type in Pandas. This will reduce the memory usage of these columns. The list is below:

['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']

cat_cols = ['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']
data[cat_cols].head(5)

	B_38	D_114	D_117	D_126	D_63	D_64	D_66	D_68
0	2.0	1.0	4.0	1.0	CR	O	NaN	6.0
1	2.0	1.0	4.0	1.0	CR	O	NaN	6.0
2	2.0	1.0	4.0	1.0	CR	O	NaN	6.0
3	2.0	1.0	4.0	1.0	CR	O	NaN	6.0
4	2.0	1.0	4.0	1.0	CR	O	NaN	6.0

data[cat_cols].nunique()

B_30     3
B_38     7
D_114    2
D_116    2
D_117    7
D_120    2
D_126    3
D_63     6
D_64     4
D_66     2
D_68     7
dtype: int64

To convert a columns to categorical type, you can use the following code:

data[cat_cols] = data[cat_cols].astype('category')

for cat in cat_cols:
    data[cat] = data[cat].cat.codes

Among the numerical columns, we also discover that there are some columns that are categorical. We can convert them to categorical type in Pandas. This will reduce the memory usage of these columns

for col in data.select_dtypes(include='number'):
    if data[col].nunique() < 3:
        print('Columns with less than 2 unique values', data[col].name, data[col].dtype)

Columns with less than 2 unique values D_87 float64
Columns with less than 2 unique values B_31 int64

data[['D_87', 'B_31']].value_counts()

D_87  B_31
1.0   1       29
      0        6
Name: count, dtype: int64

for col in ['D_87', 'B_31']:
    data[col] = data[col].fillna(-1)
    data[col] = data[col].astype('int8')

Convert `S_2` to a `datetime` data type

The S_2 column is a date column that is stored as a string. We can convert it to a datetime data type in Pandas using the following code:

data['S_2'] = pd.to_datetime(data['S_2'], format='%Y-%m-%d')

This will reduce the memory usage of this column but we can go further and split the date into year, month, and day columns. This will allow us to analyze the data by year, month, and day, and will also reduce the memory usage even further.

# Split into separate columns
data['S_2_year'] = (data['S_2'].dt.year % 100).astype('int8')
data['S_2_month'] = data['S_2'].dt.month.astype('int8')
data['S_2_day'] = data['S_2'].dt.day.astype('int8')
del data['S_2']

185 columns `float64` to `float32` or `float16`

The first thing you notice when you inspect the info() method is that we have 185 float64 columns which typically consume 64 bits, or 8 bytes per element. This type of data type is used to represent double precision floating point numbers, which are numbers that have very large or very small magnitude and/or require a significant number of digits of precision. As we are going to learn latter, we can reduce the memory usage by changing the data type of these columns to float32 and even float16.

len(data.select_dtypes(include=np.float64).columns)

After all the transformation so far, we reduced the number to 175 columns. If we change the data type of these columns to float32, we will reduce the memory usage by half. If we change the data type to float16, we will reduce the memory usage by 75%.

So converting from float64 to float32 you can save 50% of memory usage without a significant loss in precision for many applications.

As Chris Deotte pointed out in the notebook previously mentioned, They discovered that we can convert these to float16 since the data has added uniform noise. This means that the data is not as precise as it seems. This can be also done for privacy protection.

for col in data.select_dtypes(include=np.float64).columns:
    data[col] = data[col].fillna(-1).astype('float16')

Chunk strategy

Once we’ve thoroughly understood our dataset and identified all necessary transformations to make it more manageable, the next step is to apply these transformations to the entire dataset. To avoid memory issues, we will process the data in chunks, similar to how batches are used in deep learning. Therefore, it’s essential to develop a strategy for determining the sizes of these chunks. We will calculate the memory consumption per row based on our original sample data, allowing us to optimize the chunk size effectively.

Memory consumption for 35,000 rows: 59.81 MB

data = pd.read_csv(PATH, nrows=35000)
data.memory_usage(deep=True).sum() / 1024**2

59.81640434265137

Memory usage per row is: \(\frac{59.81}{35000}=0.0017\) MB

(data.memory_usage(deep=True).sum() / 1024**2)/data.shape[0]

0.0017090401240757534

In order to have enough memory we are going to consider to use half of the memory available.

Target memory usage aprox: \(\frac{8.26}{2}=4.15\) GB

available_memory_gb() / 2

4.147722244262695

Chunk size in rows: \(\frac{4.15}{0.0017}=2431508\) rows

(available_memory_gb() / 2) *1000 /  ((data.memory_usage(deep=True).sum() / 1024**2)/data.shape[0])

2431507.734935815

A more conservative approach would be to scale down this estimate by a factor to ensure smooth processing.

2431507.734935815 * 0.1

243150.7734935815

Our final chunk size will be 243151 rows.

del data
gc.collect()

Introduction to PyArrow and Parquet

PyArrow is a Python library that provides a bridge between Python and the Apache Arrow in-memory data format.

Apache Arrow is designed to improve the performance and efficiency of data by providing a standardized in-memory columnar data format that can be shared between different systems and languages. PyArrow includes support for reading and writing Parquet files.

import pyarrow.parquet as pq -> Imports the PyArrow library’s Parquet module, which provides tools for reading and writing Parquet files

import pyarrow as pa -> Imports the PyArrow library to access data structures like Tables, which are used to interface with Parquet files and other Arrow functionalities.

Once the libraries have been imported, we need to define the path where the Parquet file is located. Also we need to initialize the writervariable to None that will be used to create a ParquetWriter object, which will be used to write the data to the Parquet file.

output_path = 'train.parquet'
writer = None

Based on our estimation of memory usage, we set the chunksize to 243151. This is going to control how much data is read into memory at once. We can then iterate over the chunks of the DataFrame and write them to the Parquet file.

Next we can start to apply all the transformations to the data.

chunksize = 243151
for chunk in pd.read_csv(PATH, chunksize=chunksize, engine='c'):
    # Transforming customer_ID
    chunk['customer_ID'] = chunk['customer_ID'].apply(lambda x: int(x[-16:], 16)).astype('int32')
    # Apply datetime transformations 
    chunk['S_2'] = pd.to_datetime(chunk['S_2'], format='%Y-%m-%d')
    chunk['year_last_2_digits'] = chunk['S_2'].dt.year % 100
    chunk['month'] = chunk['S_2'].dt.month
    chunk['day'] = chunk['S_2'].dt.day
    chunk[['year_last_2_digits', 'month', 'day']] = chunk[['year_last_2_digits', 'month', 'day']].astype('int8')
    # Apply category transformations
    chunk[cat_cols] = chunk[cat_cols].astype('category')
    for cat in cat_cols:
        chunk[cat] = chunk[cat].cat.codes.astype('int8')
    # Additional columns to convert to 'int8'
    for col in ['D_87', 'B_31']:
        chunk[col] = chunk[col].fillna(-1).astype('int8')
        
     # Convert float64 columns to float16 for all floating type columns
    for col in chunk.select_dtypes(include=np.float64).columns:
        chunk[col] = chunk[col].fillna(-1).astype('float16')

Inside of the loop, we need to convert the the chunk dataframe into a PyArrow Table object using the pa.Table.from_pandas() method. This method takes the DataFrame as input and returns a Table object that can be written to a Parquet file.

# Convert DataFrame to PyArrow Table
table = pa.Table.from_pandas(chunk)

Now we need to initialize the ParquetWriter object if it is not already initialized. We can do this by checking if the writer variable is None. If it is, we create a new ParquetWriter object using the pq.ParquetWriter() method. This method takes the output path and the schema of the data as input.

if writer is None:
        writer = pq.ParquetWriter(output_path, table.schema, compression='snappy')

Finally the next line of code, writes the PyArrow Table object to the Parquet file using the writer.write_table() method. This is done incrementally for each chunk of data read from the CSV file.

writer.write_table(table)

After the loop is finished, we need to close the ParquetWriter object using the writer.close() method. This will finalize the writing process and save the data to the Parquet file.

if writer:
    writer.close()

This is a simple example of how to convert a large CSV file to a Parquet file using PyArrow. This process can be customized further based on the specific requirements of the data and the desired output format.

Running the whole code

Let’s run the whole code to get our final train set in Parquet format.

# Define the path for the output Parquet file
output_path = 'train.parquet'
# Initialize PyArrow Parquet writer, initially without a schema
writer = None

# Read the CSV file in chunks
chunksize = 243151  # Customized to your available memory and dataset size

# Cat columns 
cat_cols = ['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']

for chunk in pd.read_csv(PATH, chunksize=chunksize, na_values=-1, engine='c'):
    # Transforming customer_ID
    chunk['customer_ID'] = chunk['customer_ID'].apply(lambda x: int(x[-16:], 16)).astype('int32')
    # Apply datetime transformations 
    chunk['S_2'] = pd.to_datetime(chunk['S_2'], format='%Y-%m-%d')
    chunk['year_last_2_digits'] = chunk['S_2'].dt.year % 100
    chunk['month'] = chunk['S_2'].dt.month
    chunk['day'] = chunk['S_2'].dt.day
    chunk[['year_last_2_digits', 'month', 'day']] = chunk[['year_last_2_digits', 'month', 'day']].astype('int8')
    # Apply category transformations
    chunk[cat_cols] = chunk[cat_cols].astype('category')
    for cat in cat_cols:
        chunk[cat] = chunk[cat].cat.codes.astype('int8')
    # Additional columns to convert to 'int8'
    for col in ['D_87', 'B_31']:
        chunk[col] = chunk[col].fillna(-1).astype('int8')
        
     # Convert float64 columns to float16 for all floating type columns
    for col in chunk.select_dtypes(include=np.float64).columns:
        chunk[col] = chunk[col].fillna(-1).astype('float16')
    
    # Convert DataFrame to PyArrow Table
    table = pa.Table.from_pandas(chunk)
    
    # Initialize writer with schema from the first chunk
    if writer is None:
        writer = pq.ParquetWriter(output_path, table.schema, compression='snappy')
    
    # Write table to Parquet file
    writer.write_table(table)

# Don't forget to close the writer to finalize the Parquet file
if writer:
    writer.close()