Numerical data types in Python, datatime pandas, modulus operator %, pd.options.display.max_info_columns, memory_usage(), Save order, Parquet and Feather, psutil library, managing memory in notebooks

Daily Note - 16/05/2024

1. Difference between `int32` and `int64` in Python

In Python libraries like pandas and numpy, we can use int32 and int64 to represent integers. The difference between them is the amount of memory they use. int32 uses 32 bits (4 bytes) to represent an integer, while int64 uses 64 bits (8 bytes). This means that int64 can represent larger numbers than int32. For example, int32 can represent numbers from -2,147,483,648 to 2,147,483,647, while int64 can represent numbers from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807.

2. Float Numbers data types in Python

In Python, there are several data types to represent numbers, one of them is float numbers. Float numbers are used to represent real numbers, and they can have decimal points. There are two main float data types in Python: float32 and float64.

A float64 (or double precision) stores numbers with approximately 15-17 decimal digits of precision and requires 8 bytes per number.A float32 (or single precision) stores numbers with about 6-9 decimal digits of precision and requires only 4 bytes per number.

Converting from float64 to float32 can save half the memory usage without a significant loss in precision for many applications, although the exact impact depends on the specific data and requirements.

A float16 (or half precision) provides even less precision, about 3-4 decimal digits, and requires only 2 bytes per number.

3. Convert year, month, day to `int8`

When working with date columns in pandas, it is common to convert them to int8 to save memory. int8 data type can store integers from -128 to 127, which is enough to represent the year, month, and day.

int8 occupies only 1 byte of memory per entry, whereas int32 uses 4 bytes and int64 uses 8 bytes. This difference becomes significant when dealing with large datasets.

4. Remainder and modulus operator `%`

The remainder is the amount left over after performing a division operation between two numbers. For example, when you divide 17 by 5, the quotient is 3 and the remainder is 2.

The modulus operator % is a mathematical tool used in programming to find the remainder of a division of one number by another. It is often used to determine if a number is even or odd, or to extract the last digit of a number like in the case of extracting the last two digits of a year. In pandas, you can extract the last two digits:

data['S_2'].dt.year % 100

Where data['S_2'] is a datetime column and dt.year extracts the year from the datetime column.

5. `pd.options.display.max_info_columns`

pd.options.display.max_info_columns is a pandas option that controls the maximum number of columns displayed when using the df.info() method. It is useful when working with large datasets with many columns.

pd.options.display.max_info_columns = 300

6. `memory_usage()` method in pandas

The memory_usage() method in pandas is used to calculate the memory usage of a DataFrame. By default, it returns the memory usage of each column in bytes. You can pass the deep=True argument to introspect the data deeply by interrogating object dtypes for system-level memory consumption. Documentation here

data.memory_usage(deep=True)['customer_ID']

7. Save order, Parquet and Feather

In the cases where managing large datasets is critical, choosing the right file format based on save order and compression can be important.

Save Order

The save order refers to how data is physically stored in a file. It can be either row-oriented or column-oriented. Row-oriented storage is when data is stored row by row, while column-oriented storage is when data is stored column by column.

CSV file save data in a row-wise manner. Each row is written sequentially , and when reading the file, it typically reads row by row. If your analysis or processing often requires accessing complete rows at a time then CSV might be suitable. However if you only need to access specific columns, CSV is inefficient because it loads entire rows into memory. Also keep in mind, CSV files doesn’t support compression natively, and doesn’t preserve datatypes.

Parquet and Feather

Parquet and feather are designed to store data column by column. This format is particularly beneficial for analytical processing where queries often involve specific columns across a wide range of rows. Both formats support compression and preserve datatypes.

So if your data access is mostly columnar, then use Parquet or Feather. Use Parquet if you need efficient storage and excellent compression, or use Feather if you need fast read and write times.

For the AMEX competition, where datasets are typically large, efficiency is crucial. Parquet is often the preferred choice due to its performance benefits in terms of storage, partial reads, and data type preservation

8. `psutil` library

psutil is a Python library that provides an interface for retrieving information on running processes and system utilization. It can be used to monitor system resources like CPU, memory, disk, and network usage. Documentation here

I’ve created a function to know the available memory in the system.

import psutil

def available_memory_gb():
    return psutil.virtual_memory().available / (1024**3)

9. Managing memory with notebooks

When working with large datasets in Jupyter notebooks, it is important to manage memory efficiently to avoid running out of memory.

In Python, memory management is primarily handled by the garbage collector, which automatically frees up memory when objects are no longer needed. However, in some cases, especially when working with large data structures, it can be beneficial to manually intervine to ensure meory is freed up more promptly.

Here are some tips to manage memory in notebooks:

Use `del` to delete variables

When you no longer need a variable, use the del statement to delete it from memory. This will free up memory that can be used for other operations.

import gc

del variable_name  # Delete the variable to free up memory
gc.collect()  # Explicitly call garbage collector

Restart the kernel

If you are running out of memory, restarting the kernel can help free up memory. This will clear all variables and objects from memory, allowing you to start fresh.

Monitor System Memory

You can use system tools like the task manager on Windows, or top and htop on Linux to monitor system memory usage.

Monitor GPU use (NVIDIA GPUs)

The most straightforward way to monitor GPU usage is to use the nvidia-smi command in the terminal. This command provides real-time information about GPU utilization, memory usage, and temperature.

You can also run it in a continous monitoring mode by running nvidia-smi -l 1 in the terminal.

For deeplearning training model is quite common to use nvidia-smi dmon to monitor the GPU usage.

Daily Note - 16/05/2024

1. Difference between int32 and int64 in Python