Daily Note - 15-05-2024

AMEX
bash
python
mamba
pandas
Author

JM Ascacibar

Published

May 15, 2024

String encoding, bits and bytes, Hexadecimal strings, installing mamba, installing dask, wc -l, random.sample(), sorted() and sort(), pd.options

Daily Note - 15/05/2024

1. String encoding

String encoding refers to the process of converting a string of characters into a sequence of bytes. When a string is stored or transmitted, it is encoded into bytes, When it is read or received, it is decoded back into a string. The encoding ensures that the text remains consistent and interpretable across different systems.

2. Units of digital information: bits and bytes

Bits and bytes are the basic units of digital information. It’s the most basic unit of data in computing. A bit is a binary digit that can have a value of 0 or 1. They are used to represent binary data.

A byte is larger unit of data, typically consisting of 8 bits. It’s used to represent characters, numbers, and other data. Bytes are the standard unit for measuring data storage and transmission.

1 byte = 8 bits, this means that a byte can represent 256 different values (2^8).

For example the character ‘A’ is represented by the byte 01000001 in ASCII encoding. This is a 8-bit sequence, so it’s a byte.

3. Common Encodings

  • ASCII: American Standard Code for Information Interchange. It is a character encoding standard for electronic communication. Uses 7 or 8 bits (Extended ASCII) to represent each character. It can represent 128 character or 256 characters in the extended version.

  • UTF-8 (Unicode Transformation Format): It is a variable-width character encoding standard that represent every character in the Unicode character set. Each character is represented by 1 to 4 bytes. It is backward compatible with ASCII for the first 128 characters (1 byte each)

  • UTF-16: It is a variable-width character encoding standard that uses 16 bits to represent each character. It can represent over a million characters.

  • UTF-32: It is a fixed-width character encoding standard that uses 32 bits to represent each character. Represent every character in the Unicode standard using a fixed width.

4. Hexadecimal strings

Hexadecimal, often abbreviated as “hex,” is a base-16 numeral system used in mathematics and computing. It extends the decimal (base-10) system by adding six additional symbols, using digits 0-9 and letters A-F

Hexadecimal Table

Decimal Hexadecimal
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 A
11 B
12 C
13 D
14 E
15 F

5. AMEX Competition customer_ID convert to int32 or int64

The customer_ID is a variable that store customer ids in a string format of length 64. This implies that each custormer id consist of 64 characters, which is equivalent to 64 bytes (1 byte each character) each string.

If you convert these strings to a more compact numerical representation (such as an integer), you can significantly reduce the memory usage.

We know that int32 (32-bit integer) uses 4 bytes of memory (1 byte = 8 bits, 4 bytes = 32 bits) and int64 (64-bit integer) uses 8 bytes of memory (1 byte = 8 bits, 8 bytes = 64 bits).

So converting the customer_ID to int32 or int64 will reduce the memory usage per row to 4 bytes or 8 bytes respectively. This can be a significant reduction in memory usage, especially in this competiton where the data is 50GB.

6. Abort any preprocessing in the AMEX competition

In order to save time and because my computer can not handle the size of the data set. I’m going to abort any preprocessing and I’m going to use the AMEX data - integer dtypes - parquet format from @raddar which is 4.94GB.

To download @raddar dataset:

kaggle datasets download -d raddar/amex-data-integer-dtypes-parquet-format

7. Installing mamba

Documentation

After messing around with .bashrc and having issues with the PATH of all my programs, I decided to install mamba

  1. Install mamba:
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh
  1. Install Jupyter:
mamba install -c conda-forge jupyterlab
  1. Create alias for Jupyter Lab with no-browser:
nano ~/.bashrc
alias jlab='jupyter lab'

Save and close the file. If you’re using nano, you can do this by pressing Ctrl+O, Enter, and then Ctrl+X to exit.

8. Installing dask

Documentation

mamba install dask -c conda-forge

9. Getting the number of rows from your csv file in the terminal

This can be done efficiently using a tool like wc -l. The -l flag tells wc to count the number of lines in the file.

wc -l train_data.csv
5531452 train_data.csv

10. Using random.sample() to get a random sample

The random.sample() function can be used to get a random sample of items from a list or other sequence. It takes two arguments: the sequence to sample from and the number of items to sample.

random.sample(population, k)

Where population is the sequence to sample from and k is the number of items to sample.

import random

items = [1, 2, 3, 4, 5]
sample = random.sample(items, 3)
print(sample)
[1, 5, 3]

11. Difference between sorted() and sort()

  • sorted(): This function returns a new sorted list from the elements of any iterable. It does not modify the original list.
  • sort(): This method sorts the list in place. It modifies the original list and returns None.
L = [1, 5, 4, 2, 3]
print(sorted(L), L)
print(L.sort(), L)
[1, 2, 3, 4, 5] [1, 5, 4, 2, 3]
None [1, 2, 3, 4, 5]

12. I’m going to preprocess the data in the AMEX competition

After a break and say that I’m going to use the data from @raddar, I decided to preprocess the data in the AMEX competition.

I’ve realised that I can read a few numbers of rows in order to see the data and the columns. This will help me to understand the dataset.

data = pd.read_csv('train_data.csv', nrows=35000)

To select the object columns:

data.select_dtypes(include=['object'])

13. pd.options.display.max_columns and pd.options.display.max_rows

These options allow you to control the maximum number of columns and rows displayed when printing a DataFrame.

pd.options.display.max_columns = 50
pd.options.display.max_rows = 50

This will display up to 50 columns and 50 rows when printing a DataFrame.

Pandas documentation

Back to top