Concept symmetric function, Order statistics, Quantiles, Python args syntax, Color palette in seaborn, standarization, plus or minus shortkey, not equal element-wise numpy function

Daily Note - 08/05/2024

1. Symmetric function

Symmetric function refer to a mathematical function that treats all input features equally and produce the same output regardless of the order or arrangement of the features. In other words, these functions are insensitive to the specific ordering or permutation of the input variables.

Consider the following example:

\(f(x, y, z) = x + y + z\) is a symmetric function because \(f(x, y, z) = f(y, z, x)\) the order of the input variables does not matter. For example, \(f(1, 2, 3) = f(3, 2, 1) = 6\).

2. Quantiles

In statistics, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. Common quantiles have special names: for instance quartiles(four groups), deciles(ten groups), and percentiles(hundred groups).

Wikipedia link

Pandas quantile link

3. Order Statistics

In statistics, the k-th order statistic of a statistical sample is equal to its k-th smallest value. Together with rank statistics, order statistics are among the most fundamental tools in non-parametric statistics and inference. Important special cases are the minimum and maximum value of a sample.

4. Create a list of functions to apply to each row, and then use a loop to create new columns for each function

This code defines a list of tuples, where each tuple contains a function and a column name.

The create_order_stats function applies each function to the init_feat columns of the input DataFrame df. The lambda functions are used to create custom functions for the 95th and 5th percentiles. Finally, the code loops over the train and test DataFrames, applying the create_order_stats function to each one.

I used the *args syntax to allow for functions with varying numbers of arguments. If a function has additional arguments (like np.percentile), they are passed as a dictionary in the args list. If a function has no additional arguments (like np.min), the args list is empty

def create_order_stats(df, init_feat):
    funcs = [
        (np.min, 'Fmin'),
        (np.percentile, '1P', {'q': 0.01}),
        (np.percentile, '5P', {'q': 0.05}),
        (np.percentile, '10P', {'q': 0.10}),
        (np.percentile, '25P', {'q': 0.25}),
        (np.median, 'Fmedian'),
        (np.percentile, '75P', {'q': 0.75}),
        (np.percentile, '90P', {'q': 0.90}),
        (np.percentile, '95P', {'q': 0.95}),
        (np.percentile, '99P', {'q': 0.99}),
        (np.max, 'Fmax'),
        (np.std, 'Fstd'),
        (np.var, 'Fvar'),
        (skew, 'Fskew'),
        (kurtosis, 'Fkurt')
    ]

    for func, col_name, *args in funcs:
        if args:
            df[col_name] = func(df[init_feat].values, axis=1, **args[0])
        else:
            df[col_name] = func(df[init_feat].values, axis=1)

for df in [train, test]:
    create_order_stats(df, init_feat)

5. Color Palette in Seaborn

You can find color palettes in seaborn here

6. Implications of negative correlation between features:

A negative correlation between two variables can indicate that one attribute is sustitute for the other. This means that as one variable increases, the other decreases.

7. Standarize before creating polynomial features?

When applying polynomial features, it’s generally recommended to standarize your data after creating the polynomial features. If you didn’t standarize these new features after creating them, although the new features are “standarized” your new features will be one or more magnitude smaller than your old features.

Do I have to standardize my new polynomial features?

Should I standardize first or generate polynomials first?

@AMBROSM standarize before…notebook

8. Adding Plus or Minus

Hold down the Alt key and type 241. ±

9. Get Not equal to of dataframe and other, element-wise (binary operator ne). -> Pandas ne

Instead of using the != operator, you can use the ne method to get the element-wise not equal to of the dataframe and other.

df['hasBlastFurnaceSlag'] = df.BlastFurnaceSlag.ne(0).astype(int)