import random
items = [1, 2, 3, 4, 5]
sample = random.sample(items, 3)
print(sample)[1, 5, 3]
JM Ascacibar
May 15, 2024
String encoding, bits and bytes, Hexadecimal strings, installing mamba, installing dask, wc -l, random.sample(), sorted() and sort(), pd.options

String encoding refers to the process of converting a string of characters into a sequence of bytes. When a string is stored or transmitted, it is encoded into bytes, When it is read or received, it is decoded back into a string. The encoding ensures that the text remains consistent and interpretable across different systems.
Bits and bytes are the basic units of digital information. It’s the most basic unit of data in computing. A bit is a binary digit that can have a value of 0 or 1. They are used to represent binary data.
A byte is larger unit of data, typically consisting of 8 bits. It’s used to represent characters, numbers, and other data. Bytes are the standard unit for measuring data storage and transmission.
1 byte = 8 bits, this means that a byte can represent 256 different values (2^8).
For example the character ‘A’ is represented by the byte 01000001 in ASCII encoding. This is a 8-bit sequence, so it’s a byte.
ASCII: American Standard Code for Information Interchange. It is a character encoding standard for electronic communication. Uses 7 or 8 bits (Extended ASCII) to represent each character. It can represent 128 character or 256 characters in the extended version.
UTF-8 (Unicode Transformation Format): It is a variable-width character encoding standard that represent every character in the Unicode character set. Each character is represented by 1 to 4 bytes. It is backward compatible with ASCII for the first 128 characters (1 byte each)
UTF-16: It is a variable-width character encoding standard that uses 16 bits to represent each character. It can represent over a million characters.
UTF-32: It is a fixed-width character encoding standard that uses 32 bits to represent each character. Represent every character in the Unicode standard using a fixed width.
Hexadecimal, often abbreviated as “hex,” is a base-16 numeral system used in mathematics and computing. It extends the decimal (base-10) system by adding six additional symbols, using digits 0-9 and letters A-F
| Decimal | Hexadecimal |
|---|---|
| 0 | 0 |
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
| 6 | 6 |
| 7 | 7 |
| 8 | 8 |
| 9 | 9 |
| 10 | A |
| 11 | B |
| 12 | C |
| 13 | D |
| 14 | E |
| 15 | F |
customer_ID convert to int32 or int64The customer_ID is a variable that store customer ids in a string format of length 64. This implies that each custormer id consist of 64 characters, which is equivalent to 64 bytes (1 byte each character) each string.
If you convert these strings to a more compact numerical representation (such as an integer), you can significantly reduce the memory usage.
We know that int32 (32-bit integer) uses 4 bytes of memory (1 byte = 8 bits, 4 bytes = 32 bits) and int64 (64-bit integer) uses 8 bytes of memory (1 byte = 8 bits, 8 bytes = 64 bits).
So converting the customer_ID to int32 or int64 will reduce the memory usage per row to 4 bytes or 8 bytes respectively. This can be a significant reduction in memory usage, especially in this competiton where the data is 50GB.
In order to save time and because my computer can not handle the size of the data set. I’m going to abort any preprocessing and I’m going to use the AMEX data - integer dtypes - parquet format from @raddar which is 4.94GB.
To download @raddar dataset:
kaggle datasets download -d raddar/amex-data-integer-dtypes-parquet-format
After messing around with .bashrc and having issues with the PATH of all my programs, I decided to install mamba
mamba:wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh
Jupyter:mamba install -c conda-forge jupyterlab
nano ~/.bashrc
alias jlab='jupyter lab'
Save and close the file. If you’re using nano, you can do this by pressing Ctrl+O, Enter, and then Ctrl+X to exit.
mamba install dask -c conda-forge
This can be done efficiently using a tool like wc -l. The -l flag tells wc to count the number of lines in the file.
wc -l train_data.csv
5531452 train_data.csv
random.sample() to get a random sampleThe random.sample() function can be used to get a random sample of items from a list or other sequence. It takes two arguments: the sequence to sample from and the number of items to sample.
random.sample(population, k)
Where population is the sequence to sample from and k is the number of items to sample.
sorted(): This function returns a new sorted list from the elements of any iterable. It does not modify the original list.sort(): This method sorts the list in place. It modifies the original list and returns None.After a break and say that I’m going to use the data from @raddar, I decided to preprocess the data in the AMEX competition.
I’ve realised that I can read a few numbers of rows in order to see the data and the columns. This will help me to understand the dataset.
data = pd.read_csv('train_data.csv', nrows=35000)
To select the object columns:
data.select_dtypes(include=['object'])
pd.options.display.max_columns and pd.options.display.max_rowsThese options allow you to control the maximum number of columns and rows displayed when printing a DataFrame.
pd.options.display.max_columns = 50
pd.options.display.max_rows = 50
This will display up to 50 columns and 50 rows when printing a DataFrame.