Daily Note - 09-05-2024

scikit-learn
probability theory
partial Dependence plots
Author

JM Ascacibar

Published

May 9, 2024

ColumnTransformer and make_pipeline, Spline transformer, os.system(), Independence concept, Feature indepence, Partial Dependence Plot, Individual conditional expectations (ICE)

Daily Note - 09/05/2024

1. Using column transformer with a pipeline

To create a pipeline with a ColumnTransformer that applies a StandardScaler and PolynomialFeatures to a subset of features, and then uses a Ridge regressor, you need to use the make_pipeline function with column transformer inside first. The trick come next. Inside of column transformer, you want to standardize first and apply the polynomial features next to a specific set of features. To do that, you need to create a new make_pipeline inside the column transformer. The code below shows how to do that:

model = make_pipeline(
    ColumnTransformer([
        ('scaler_poly', make_pipeline(StandardScaler(), PolynomialFeatures(2)), init_feat)
    ], remainder='passthrough'),
    Ridge(alpha=73)
)

2. Spline transformer

Spline transformer allows you to add nonlinear features without using pure polynomials. Sometimes can match the data better than polynomials.

3. Executing a system command inside of a notebook

To execute a system command inside of a notebook, you can use the exclamation mark before the command. But if you want to store part of the command in a variable, you need to use the f-string to concatenate the variable with the command inside of os module using the os.system() function. The code below shows how to do that:

os.system(f'head {filename}')

4. Independence (probability theory)

We say that two events are independent if the occurrence of one event does not affect the probability of the other event.

Similarly, two random variables are independent (\(A \perp B\)) if the realization of one doesn’t affect the probability distribution of the other.

5. Feature independance

In machine learning and statistics, feature independence is a concept that refers to the assumption that each feature in a model is independent of the other features. However, in practice, this assumption is rarely true. In fact, most features are correlated with each other to some extent. You can use feature engineering techniques to create new features that are more independent of each other. You can also use regularization techniques to reduce the impact of correlated features on your model. Using a model that can handle correlated features, such as a ensemble tree-based model like random forest, can also help.

6. Partial dependence plot

Partial dependence plots can help to understand how each feature affects the model’s predictions. For example how does the water component or the cement component affect the prediction of concrete strength.

For example, we want to understand how the cement and water features affect the predicted strength of the concrete. We can create partial dependence plots to visualize these relationships.

Cement (kg/m³) Predicted Strength (MPa)
200 20
220 22
240 24
260 26
280 28
300 30
320 32
340 34
360 36
380 38
400 40

In this plot, we’ll hold the water feature constant at its median value (say, 150 kg/m³) and vary the cement feature from 200 kg/m³ to 400 kg/m³. For each cement value, we’ll calculate the predicted strength using our model and plot the average predicted strength.

The plot shows that as the amount of cement increases, the predicted strength of the concrete also increases. This makes sense, as cement is a key component of concrete that provides strength.

7. Individual conditional expectation (ICE) plot:

Unlike a PDP, which shows the average effect of the input feature, an ICE plot visualizes the dependence of the prediction on a feature for each sample separately with one line per sample.

Scikit-Learn documentation

Resources

  • I’ve found a very good course about ML by professor Larry Wasserman. Link
  • ritvikmat is a great youtube channel about statistics and ml
Back to top