Sklearn Tutorial Python

Scikit Learn (or Sklearn) is one of the most used Python libraries in the world of Machine Learning. Without a doubt, it is a fantastic library since it offers a very simple way to create Machine Learning models of all kinds. But do you know how it works and the tricks it has? In this tutorial, I am going to explain everything you need to start creating Machine Learning models in Python with Scikit Learn. Sounds good to you? Well, let’s get to it!

Introduction to Scikit Learn

Scikit Learn is a Machine Learning library in Python that seeks to help us in the main aspects when facing a Machine Learning problem. More specifically, Scikit Learn has functions to help us:

  • Data preprocessing, including no:
    • Split between train and test.
    • Imputation of missing values.
    • Data transformation ()
    • Feature engineering
    • Feature selection
  • Creation of models, including:
    • Supervised models
    • Unsupervised models
  • Optimization of hyperparameters of the models

As you can see, Scikit Learn is a very complete and very useful library (which is why it is so well known and used). To see all these functionalities we are going to need some data. For this, I will use the datasets module of Scikit-learn to be able to use some of the datasets that the library brings by default:

from sklearn import datasets
import pandas as pd
import numpy as np

wine = datasets.load_wine()

data = pd.DataFrame(data= wine['data'],
                    columns= wine['feature_names'])

y = wine['target']
print(y[:10])
data.head()
[0 0 0 0 0 0 0 0 0 0]
alcoholmalic_acidashalcalinity_of_ashmagnesiumtotal_phenolsflavanoidsnonflavanoid_phenolsproanthocyaninscolor_intensityhueod280/od315_of_diluted_winesproline
014.231.712.4315.6127.02.803.060.282.295.641.043.921065.0
113.201.782.1411.2100.02.652.760.261.284.381.053.401050.0
213.162.362.6718.6101.02.803.240.302.815.681.033.171185.0
314.371.952.5016.8113.03.853.490.242.187.800.863.451480.0
413.242.592.8721.0118.02.802.690.391.824.321.042.93735.0

Let’s start with our Scikit Learn tutorial by looking at the logic behind Scikit learn. Let’s get to it!

Logic behind Sklearn

A very interesting and useful thing about Sklearn is that, both in the preparation of the data and in the creation of the model, it makes a distinction between train and transform or predict.

This is something very interesting since it allows us to save these train files so that, when making the transformations or the prediction, we simply have to load that file and make the transformation/prediction.

So, when we work with Sklearn, we will have to get used to first doing the train and then executing it on our data.

Knowing this, let’s see how Sklearn works!

Data-preprocessing with Sklearn

Split between train and test

As you may already know, before covering any transformation in our dataset, we must first divide our data between train and test. The idea is that the test data is not considered in any transformation, as if it were really new.

Thus, to perform the split between train and test we have the train_test_split function, which returns a tuple with four elements: Xtrain, Xtest, Ytrain, and Ytest.

Likewise, for the split to be reproducible we can set the seed using the random_state parameter.

Let’s see how it works:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, y, 
                                                    test_size = 0.8,
                                                    random_state = 1234)

print(f'X train shape {X_train.shape} \nX test shape {X_test.shape}')
X train shape (35, 13) 
X test shape (143, 13)

As you can see, doing the split with sklearn is super simple. Now, let’s move on to our sklearn tutorial, looking at how to impute missing values.

Imputation of Missing-Values with Sklearn

First of all, we are going to check if our dataset contains missing values so that we can impute them:

X_train.isna().sum()
alcohol                         0
malic_acid                      0
ash                             0
alcalinity_of_ash               0
magnesium                       0
total_phenols                   0
flavanoids                      0
nonflavanoid_phenols            0
proanthocyanins                 0
color_intensity                 0
hue                             0
od280/od315_of_diluted_wines    0
proline                         0
dtype: int64

As we can see, the dataset does not contain any missing values, but nothing happens. We are going to use a copy of this dataset and create Na’s to demonstrate how missing value imputation works in Sklearn:

data_na = X_train.copy()

for col in data_na.columns:
    data_na.loc[data_na.sample(frac=0.1).index, col] = np.nan

data_na.isna().sum()
alcohol                         4
malic_acid                      4
ash                             4
alcalinity_of_ash               4
magnesium                       4
total_phenols                   4
flavanoids                      4
nonflavanoid_phenols            4
proanthocyanins                 4
color_intensity                 4
hue                             4
od280/od315_of_diluted_wines    4
proline                         4
dtype: int64

When we have missing values, there are several approaches we can take:

  • Eliminate the observations.
  • Imputing a constant value obtained from the variable itself (the mean, mode, median, etc.) This type of imputation is known as univariate imputation. < / li>
  • Use all available variables to use imputation, that is, multivariate imputation. A typical multivariate imputation model is the use of a kNN model.

We have all these options within the Sklearn impute module. Let’s start with the simple imputation.

Imputation of univariate-missing-values ​​

Within the univariate imputation, we have several values ​​that we can impute, more specifically you can impute the mean, the median, the mode, or a fixed value.

I personally do not like to impute the mean, since it can be greatly affected by the distribution of the data. Instead, I usually prefer other values ​​such as the mode or the median.

Let’s see how we can do univariate imputation in Sklearn:

# Univariate Imputation
from sklearn.impute import SimpleImputer

mode_imputer = SimpleImputer(strategy = 'most_frequent')

# For each column, make imputation
for column in data_na.columns:
    values = data_na[column].values.reshape(-1,1)
    mode_imputer.fit(values)
    data_na[column] = mode_imputer.transform(values)

# Check Nas
data_na.isna().sum()
alcohol                         0
malic_acid                      0
ash                             0
alcalinity_of_ash               0
magnesium                       0
total_phenols                   0
flavanoids                      0
nonflavanoid_phenols            0
proanthocyanins                 0
color_intensity                 0
hue                             0
od280/od315_of_diluted_wines    0
proline                         0
dtype: int64

As you can see, in a very simple way we have been able to impute absolutely all the missing values ​​that we had in the dataset with the model in a very simple way.

In addition, to make the imputation with another value such as the mean or the median, the strategy would simply have to be changed to mean or median, respectively.

As you can see, imputing missing values ​​using the data of the variable itself is very easy with Sklearn. However, Sklearn goes much further and offers other issues such as imputation taking into account several variables. Let’s see how it works.

Multivariate-imputation of missing-values ​​

The idea behind a multivariate imputation is to create a regressor and try to predict each of the variables with the rest of the variables that we have. In this way, the regressor can learn the relationship between the data and can perform an imputation using all the variables in the dataset.

This is a feature that is still experimental in Sklearn. That is why, for it to work, we will first have to enable it by importing enable_iterative_imputer.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer 

data_na = X_train.copy()

# Generate new nas
for col in data_na.columns:
    data_na.loc[data_na.sample(frac=0.1).index, col] = np.nan

# Create imputer
iter_imputer = IterativeImputer(max_iter=15, random_state=1234)

# Transform data
iter_imputer_fit = iter_imputer.fit(data_na.values) 
imputed_data = iter_imputer_fit.transform(data_na)

pd.DataFrame(imputed_data, columns = data_na.columns)\
    .isna()\
    .sum()
alcohol                         0
malic_acid                      0
ash                             0
alcalinity_of_ash               0
magnesium                       0
total_phenols                   0
flavanoids                      0
nonflavanoid_phenols            0
proanthocyanins                 0
color_intensity                 0
hue                             0
od280/od315_of_diluted_wines    0
proline                         0
dtype: int64

As you can see, we have created an imputation system that takes into account all the variables in order to carry out the imputation of the missing values.

Likewise, within multivariate imputers, a very typical way to carry out imputation is using the kNN model. This is something that Sklearn also offers. Let’s see it.

Imputation by kNN

For the imputation of missing values ​​using the kNN algorithm, Sklearn looks for the observations that are most similar for each observation with missing values and uses the values ​​of those observations to do the imputation.

As in the normal kNN algorithm, the only parameter that we can modify is the number of neighbors to take into account to make the prediction.

As I explained in the post about how to code kNN from scratch in R, there are two ways to choose the number of neighbors:

  • The square root of observations ensures that you do not choose a value that is neither too small nor too large.
  • Elbow method. It consists of calculating the error for different values ​​of k and choosing the one that minimizes it.

In this case, being a test I will use the square root approximation.

So, to impute missing values ​​with Sklearn using kNN we will have to use the KNNImputer function.

from sklearn.impute import KNNImputer

data_na = X_train.copy()

# Generate new nas
for col in data_na.columns:
    data_na.loc[data_na.sample(frac=0.1).index, col] = np.nan

# Select k
k = int(np.round(np.sqrt(data_na.shape[0])))

# Create imputer
knn_imputer = KNNImputer(n_neighbors=k)

# Transform data
knn_imputer_fit = knn_imputer.fit(data_na.values)

imputed_data = knn_imputer_fit.transform(data_na)

pd.DataFrame(imputed_data, columns = data_na.columns)\
    .isna()\
    .sum()
alcohol                         0
malic_acid                      0
ash                             0
alcalinity_of_ash               0
magnesium                       0
total_phenols                   0
flavanoids                      0
nonflavanoid_phenols            0
proanthocyanins                 0
color_intensity                 0
hue                             0
od280/od315_of_diluted_wines    0
proline                         0
dtype: int64

With this, we have already seen the imputation of missing values. Now see how to transform the data with Sklearn.

Data-transformation

There are many transformations that we can and sometimes must apply to our data, such as normalization, transformations to follow a distribution, One-Hot encoding …

For this, Sklearn offers the preprocessing module, thanks to which we can perform all the transformations discussed above and more. So, let’s see what the data we have is like to, from there, start transforming the data:

X_train.describe()
alcoholmalic_acidashalcalinity_of_ashmagnesiumtotal_phenolsflavanoidsnonflavanoid_phenolsproanthocyaninscolor_intensityhueod280/od315_of_diluted_winesproline
count35.00000035.00000035.00000035.00000035.00000035.00000035.00000035.00000035.00000035.00000035.00000035.00000035.000000
mean12.9048572.1600002.37114319.17428698.4285712.3240002.0722860.3645711.6600004.7771430.9968572.635714757.714286
std0.8791991.0268050.3119743.67514213.5609420.6144110.9821620.1359140.6886992.1900650.2501030.645907361.713778
min11.0300000.9000001.71000012.00000080.0000000.9800000.3400000.1400000.4100001.9000000.5700001.330000278.000000
25%12.2700001.5550002.11000016.05000090.0000001.9000001.2800000.2700001.2600003.2850000.8700002.145000447.000000
50%13.0300001.8000002.39000019.00000098.0000002.3500002.2400000.3200001.5400004.4000000.9600002.710000640.000000
75%13.6300002.5400002.63500021.000000102.5000002.8000002.8650000.4900002.0750005.7900001.1850003.1450001057.500000
max14.3800005.0400002.92000028.500000151.0000003.8500003.6400000.6300002.96000013.0000001.7100003.6400001547.000000
%matplotlib inline
X_train.hist()
Distribución de las variables.

As we can see, we have several interesting questions that, surely, we will have to transform, such as:

  • Variables heeled to the left.
  • Variables heeled to the right.
  • Categorical variables classified as a numeric variable.
  • Variables with possible outliers .

Let’s see each of these issues going little by little. Let’s start with how to modify the data distributions.

Modify the distribution of a variable

Let’s take the example of the variable malicious_acid which is clearly listed to the left. In these cases, Sklearn within the preprocessing module offers the QuantileTransformer and PowerTranformer functions with which we can avoid skewness of our data. Let’s see how they work.

from sklearn import preprocessing
import matplotlib.pyplot as plt

quantile_transf_norm =  preprocessing.QuantileTransformer(output_distribution= 'normal')
quantile_transf_uniform =  preprocessing.QuantileTransformer(output_distribution= 'uniform')

data_to_transform = X_train['malic_acid'].values.reshape(-1,1)
transformed_data_normal =  quantile_transf_norm.fit_transform(data_to_transform)
transformed_data_uniform =  quantile_transf_uniform.fit_transform(data_to_transform)

# Create plots
fig, axs = plt.subplots(3)

axs[0].hist(X_train['malic_acid'].values)
axs[1].hist(transformed_data_normal)
axs[2].hist(transformed_data_uniform)
Transformaciones de las distribuciones con Sklearn.

As you can see, we have gone from a left-sided variable to a variable that follows a normal distribution or that follows a uniform distribution thanks to Sklearn’s QuantileTransformer function.

Clearly, the transformation to apply will depend on the specific case, but as you can see, once we know that we have to transform the data, doing it with Sklearn is very simple.

So, let’s continue with our Sklearn tutorial, seeing how to normalize or standardize variables.

How to normalize or standardize the data in Sklearn

Other typical transformations that we can apply are normalization and standardization, which we can perform with the StandardScaler and MinMaxScaler functions, respectively.

In my opinion, it is usually preferable to standardize than normalize, since normalization can cause problems in production (a value greater than 1 or less than 0).

In any case, let’s see how we can normalize and standardize in Python with Sklearn:

# Create the scaler
stand_scale = preprocessing.StandardScaler()
normal_scale = preprocessing.MinMaxScaler()

# Fit the scaler
stand_scale_fit = stand_scale.fit(X_train)
normal_scale_fit = normal_scale.fit(X_train)

# Apply the scaler to train
X_train_scale = stand_scale_fit.transform(X_train)
X_train_norm = normal_scale_fit.transform(X_train)

# Apply the scaler to test
X_test_scale = stand_scale_fit.transform(X_test)
X_test_norm = normal_scale_fit.transform(X_test)

# Convert data to DataFrame
X_train_scale = pd.DataFrame(X_train_scale, columns = data_na.columns)
X_train_norm = pd.DataFrame(X_train_norm, columns = data_na.columns)

# Check Standardization
standard_sd =  X_train_scale['malic_acid'].std()
standard_mean = X_train_scale['malic_acid'].mean()
normalized_min =  X_train_norm['malic_acid'].min()
normalized_max = X_train_norm['malic_acid'].max()

print(f'Standardized data has SD of {np.round(standard_sd)} and mean of {np.round(standard_mean)}')
print(f'Normalized data has Min of {normalized_min} and max of {normalized_max}')
Standardized data has SD of 1.0 and mean of -0.0
Normalized data has Min of 0.0 and max of 1.0

As you can see, applying to normalize or standardize the data with Sklearn is very simple. However, these transformations only apply to numeric data.

Now, let’s see how to one-hot encode the data, which is the main transformation of categorical data.

How-to-hot-encoding with Sklearn

When working with categorical variables, one of the most important things is to transform our categorical variables into numeric ones. To do this, we apply dumification or one-hot encoding, which consists of creating as many new variables less one as there are options the variable has and giving it a value of 1 or 0.

In this sense, it is a fundamental result to carry out the One-hot encoding process after having made the transformations to numeric variables (normalization, standardization, etc.). This is because, if not, we will be transforming these variables and they will no longer make sense.

Thus, performing a One-hot encoding transformation with Sklearn very easily thanks to the OneHotEncoder function. To see how it works, I’ll create a list with three possible values: UK, USA, or Australia.

label = np.array(['USA','Australia', 'UK', 'UK','USA','Australia','USA',
                  'UK', 'UK','UK','Australia'])

label
array(['USA', 'Australia', 'UK', 'UK', 'USA', 'Australia', 'USA', 'UK',
       'UK', 'UK', 'Australia'], dtype='<U9')

Thus, encoding the variable is as simple as passing the variable to the OneHotEncoder function. However, by default, this function creates as many variables as there are possible options. This is usually not a good idea, as n-1 options would suffice. Thus, with the drop = 'first' parameter we can avoid redundancies in the data.

In addition, something typical when putting a model in prediction is that a new level appears that was not contemplated in the training. By default, this will generate an error, which may not be suitable depending on (especially if it is a batch model on several data). To avoid problems if this happens we can use set to indicate handle_unknown = 'ignore' .

Let’s see how to do it:

from sklearn.preprocessing import OneHotEncoder

# Create encoder
ohencoder = OneHotEncoder(drop='first')

# Fit the encoder
ohencoder_fit = ohencoder.fit(label.reshape(-1,1))

# Transform the data
ohencoder_fit.transform(label.reshape(-1,1)).toarray()
array([[0., 1.],
       [0., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 0.]])

As you can see, transforming our data with Sklearn is something super simple. Also, that’s not all, where we can get the most out of Sklearn (and what it is best known for) is in the creation of Machine Learning models.

Let’s continue with our Sklearn tutorial, see how to create Machine Learning models.

How to create Machine-Learning models with Sklearn

Within Sklearn we have many different families of Machine Learning models that we can apply and, within each family, there may be several different models.

Therefore, in the following table, I include all the Machine Learning model families along with the name of the module where they are found:

Modelo SupervisadoModule
Linear Modelslinear_model
Linear and Quadratic Discriminant Analysisdiscriminant_analysis
Kernel ridge regressionkernel_ridge
Support Vector Machinessvm
Stochastic Gradient Descentlinear_model
Nearest Neighborsneighbors
Gaussian Processesgaussian_process
Cross decompositioncross_decomposition
Naive Bayesnaive_bayes
Decision Treestree
Ensemble methodsensemble
Multiclass and multioutput algorithmsmulticlass & multioutput
Semi-supervised learningsemi_supervised
Isotonic regressionisotonic
Probability calibrationcalibration
Neural network models (supervised)neural_networ

In the case of unsupervised models, a bit similar happens since it has many unsupervised models that we can find in different modules:

ModelModule
Gaussian mixture modelsGaussianMixture
Manifold learningmanifold
Clusteringcluster
Decomposing signals in components (matrix factorization problems)decomposition
Covariance estimationcovariance
Neural network models (unsupervised)neural_network

Although in this post we will not be able to cover all the Machine Learning models, we will use some main models. More specifically, you are going to learn how to:

  1. Create a Machine Learning model with Sklearn
  2. Validate the performance of your model with Sklearn.
  3. Find the optimal values ​​of the model’s hyperparameters.

Considering the above, let’s see how it works:

How-to-create a Machine Learning-model with Sklearn

To create a machine learning model with Sklearn, we first have to know which model we want to create, since, as we have seen previously, each model may be in a different module.

In our case, we are going to create three classification models: logistic regression and Random Forest. As in the previous cases, first of all, we are going to create our models:

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier


## -- Logistic Regression -- ##

# Create the model
log_reg = LogisticRegression(penalty = 'none')

# Train the model
log_reg_fit = log_reg.fit(X_train, y_train)

# Make prediction
y_pred_log_reg = log_reg_fit.predict(X_test)


## -- Random Forest -- ##

# Create the model
rf_class = RandomForestClassifier()

# Train the model
rf_class_fit = rf_class.fit(X_train, y_train)

# Make prediction
y_pred_rf_class = rf_class_fit.predict(X_test)

print(f'Logistic Regression predictions {y_pred_log_reg[:5]}')
print(f'Random Forest predictions {y_pred_rf_class[:5]}')
Logistic Regression predictions [1 1 1 0 0]
Random Forest predictions [1 1 1 1 2]

As we can see, we have created the models in a very simple way. Now we will have to evaluate how good our models are. Let’s see how it works.

How to measure the performance of a model in Sklearn

The way to evaluate the performance of a model is to analyze how good its predictions are. For this, Sklearn has different functions within the Sklearn metrics module.

In our case, as it is a classification model, we can use metrics such as the confusion matrix or the area under the curve (AUC), for example. Let’s see.

from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score

print('##-- Logistic Regression --## ')
print(f'Acuraccy: {accuracy_score(y_test, y_pred_log_reg)}')
print(f'Precision: {precision_score(y_test, y_pred_log_reg, average="macro")}')
print(f'Recall: {recall_score(y_test, y_pred_log_reg, average="macro")}')
print(confusion_matrix(y_test, y_pred_log_reg))


print('\n##-- Random Forest --## ')
print(f'Acuraccy: {accuracy_score(y_test, y_pred_rf_class)}')
print(f'Precision: {precision_score(y_test, y_pred_rf_class, average="macro")}')
print(f'Recall: {recall_score(y_test, y_pred_rf_class, average="macro")}')
print(confusion_matrix(y_test, y_pred_rf_class))
##-- Logistic Regression --## 
Acuraccy: 0.7202797202797203
Precision: 0.7578378120830952
Recall: 0.7104691075514874
[[40  6  0]
 [12 42  3]
 [14  5 21]]

##-- Random Forest --## 
Acuraccy: 0.965034965034965
Precision: 0.9636781090033123
Recall: 0.9682748538011695
[[46  0  0]
 [ 2 53  2]
 [ 0  1 39]]

As we can see, the Random Forest model has had a much better result than the logistic regression model. However, we have not touched any of the Random Forest hyperparameters. It is possible that looking for the optimal parameters, we will have an even better result.

So, let’s continue with the Sklearn tutorial by looking at how to find the hyperparameters of a model. Let’s go there!

How to tune-a-model with Sklearn

In order to find the optimal values ​​of the hyperparameters of a model, Sklearn offers a very useful function: GridSearchCV .

This function allows you to define a dictionary of parameters, defining for each parameter all the values ​​to be tested. In addition, for each estimate of each parameter, we can do Cross-Validation, so that there are no overfitting problems.

So, by simply passing our model and the parameters that we want it to check to the GridSearchCV function and doing the grid, Sklearn will do the grid search for all possible combinations of all hyperparameters and, By default, it will check with 5 cross-validations.

From this result, we can obtain the value obtained by all the combinations with the cv_results_ class.

So, let’s try a bunch of possible values ​​for our Random Forest and see how the model works:

from sklearn.model_selection import GridSearchCV, RepeatedKFold 

rf_class = RandomForestClassifier()
grid = {
    'max_depth':[6,8,10], 
    'min_samples_split':[2,3,4,5],
    'min_samples_leaf':[2,3,4,5],
    'max_features': [2,4,6,8,10]
    }

rf_class_grid = GridSearchCV(rf_class, grid, cv = 10)
rf_class_grid_fit =  rf_class_grid.fit(X_train, y_train)

pd.concat([pd.DataFrame(rf_class_grid_fit.cv_results_["params"]),
           pd.DataFrame(rf_class_grid_fit.cv_results_["mean_test_score"], 
                        columns=["Accuracy"])],axis=1)
max_depthmax_featuresmin_samples_leafmin_samples_splitAccuracy
062221.000000
162231.000000
262241.000000
362251.000000
462320.966667
2351010450.941667
2361010520.941667
2371010530.941667
2381010540.941667
2391010550.941667

As you can see, we obtain the accuracy of the model for different values of the hyperparameters and, as expected, different combinations of hyperparameters generate different results. Models with a very high max_depth, for example, seem to perform worse than those with a lower max_depth.

Now that we have done Grid Search we can find which is the best model of all the ones we have tried:

print(f'Best parameters: {rf_class_grid_fit.best_params_}')
print(f'Best score: {rf_class_grid_fit.best_score_}')
Best parameters: {'max_depth': 6, 'max_features': 2, 'min_samples_leaf': 2, 'min_samples_split': 2}
Best score: 1.0

Now that we have our model trained, we can make the prediction on test, to see what result we get.

y_pred_rf_grid = rf_class_grid_fit.predict(X_test)

print('##-- Random Forest Grid Search & CV --## ')
print(f'Acuraccy: {accuracy_score(y_test, y_pred_rf_grid)}')
print(f'Precision: {precision_score(y_test, y_pred_rf_grid, average="macro")}')
print(f'Recall: {recall_score(y_test, y_pred_rf_grid, average="macro")}')
print(confusion_matrix(y_test, y_pred_rf_grid))
##-- Random Forest Grid Search & CV --## 
Acuraccy: 0.972027972027972
Precision: 0.9714617554338809
Recall: 0.9766081871345028
[[46  0  0]
 [ 3 53  1]
 [ 0  0 40]]

As you can see, with the tuning we have of the hyperparameters we have achieved that our model goes from a hit rate of 95.8% to a rate of 97.2% and all this in a super simple way.

But, oddly enough, there is still more. And the thing is, for me, one of the coolest things about Sklearn is that it allows you to put the entire Machine Learning process together in the same step. Let’s continue with our Sklearn tutorial and see how pipelines work. Let’s get to it!

Creating a Machine-Learning pipeline

If you notice, the process that we have followed so far has been a fairly sequential process: first, you eliminate variables, standardize them, dummificate them, train the model and make the prediction.

Until now, each of these steps has gone separately, in such a way that first you always declare the transformation of the model, then you do the fit (or train in the case of the model) and finally, you end up applying it.

Luckily, Sklearn offers a much easier way to go about this whole process: Pipelines and Transformers Columns. Thanks to the pipelines and Column Transformers, instead of having to do each step separately, we can define what steps we want to be done and Sklearn itself will apply all the steps that we indicate sequentially.

The only difference between pipelines and column transformers is that a pipeline allows several operations to be applied on a column but without being able to parallelize it. However, the ColumnTrasnformer is only good for applying a single operation per column, but it can be parallelized.

In our case, as we have performed several operations on the same column, we will use the pipeline. Let’s see how to use it:

from sklearn.pipeline import Pipeline 
from sklearn.feature_selection import VarianceThreshold

pipe = Pipeline([
    ('scaler', preprocessing.StandardScaler()),
    ('selector', VarianceThreshold()),
    ('classifier', RandomForestClassifier(max_depth=6, 
                                          max_features= 2,
                                          min_samples_leaf = 2,
                                          min_samples_split = 3
                                          ))
    ])


pipe_fit = pipe.fit(X_train, y_train)
y_pred_pipe = pipe_fit.predict(X_test)

print('##-- Random Forest with Pipe --## ')
print(f'Acuraccy: {accuracy_score(y_test, y_pred_pipe)}')
print(f'Precision: {precision_score(y_test, y_pred_pipe, average="macro")}')
print(f'Recall: {recall_score(y_test, y_pred_pipe, average="macro")}')
print(confusion_matrix(y_test, y_pred_pipe))
##-- Random Forest with Pipe --## 
Acuraccy: 0.972027972027972
Precision: 0.9702380952380952
Recall: 0.9766081871345028
[[46  0  0]
 [ 2 53  2]
 [ 0  0 40]]

As you can see, we have created a model by doing all the transformations of the model in a few lines of code. In addition, Sklearn offers the option of saving the model and the pipeline, so that putting the model into production is quite simple. Let’s see how to save a pipeline:

from joblib import dump, load
from datetime import datetime

dump(pipe_fit, 'models/pipeline.joblib')

# Remove element and reload it
del(pipe_fit)

try:
    pipe_fit
except NameError:
    print(f'{datetime.now()}: Pipe does not exist.')

# Reload pipe
pipe_fit = load('models/pipeline.joblib')

try:
    pipe_fit
    print(f'{datetime.now()}: Pipe is loaded.')
except NameError:
    print(f'{datetime.now()}: Pipe not defined.')
2021-10-10 13:27:21.377056: Pipe does not exist.
2021-10-10 13:27:26.120434: Pipe is loaded.

Now, we can use that pipeline to be able to make the transformation on new data:

pipe_fit.predict(X_test)[:10]
array([1, 1, 1, 1, 2, 1, 2, 0, 0, 2])

Likewise, suppose there was a categorical column, on which we would like to apply One Hot Encoding, as we have seen previously. In this case, we could create:

  1. A pipeline for numeric variables.
  2. Another pipeline for categorical variables.
  3. A ColumnTransformer that applies the pipeline of numeric variables to numeric variables and the pipeline from categorical variables to categorical variables.

In this way, we will be parallelizing the transformations of numeric and categorical variables, even though the transformations for the same type of variable are sequential.

To see an example, I am going to create a new categorical variable and see how this process would be done:

from sklearn.compose import ColumnTransformer

X_train2 = X_train.copy()

# Create new categorical column
options = dict({0:'Moscatel',1:'Sultanina',2:'Merlot'})
X_train2['grape_type'] = np.random.randint(0,3, size = X_train.shape[0])
X_train2['grape_type'] = [options.get(grape) for grape in X_train2['grape_type']]

# Create pipeline for numerical variables
numeric_pipe = Pipeline([
    ('scaler', preprocessing.StandardScaler()),
    ('selector', VarianceThreshold())
    ])

# Create pipeline for categorical variable
categorical_pipe = Pipeline([
    ('encoder', preprocessing.OneHotEncoder(drop = 'first'))
    ])

# Create ColumnTransform
col_transf = ColumnTransformer([
    ('numeric', numeric_pipe, X_train2._get_numeric_data().columns.tolist()),
    ('categoric', categorical_pipe, ['grape_type'])
    ])

col_transf_fit = col_transf.fit(X_train2)
X_train2_transf = col_transf_fit.transform(X_train2)


print('##-- Row 1 Before Transformation --## ')
print(X_train2.iloc[0,:].tolist())
print('##-- Row 1 After Transformation --## ')
print(X_train2_transf[0].tolist())
##-- Row 1 Before Transformation --## 
[13.88, 5.04, 2.23, 20.0, 80.0, 0.98, 0.34, 0.4, 0.68, 4.9, 0.58, 1.33, 415.0, 'Merlot']
##-- Row 1 After Transformation --## 
[1.1253192595117543, 2.845764061669694, -0.45902297004005566, 0.2279555555839831, -1.3787844017876945, -2.2193953800197064, -1.7894974173749854, 0.2644746993271315, -1.4437479611809765, 0.05691646573974076, -1.69107399616703, -2.0510335121401804, -0.9613061482603757, 0.0, 0.0]

As you can see, we have applied all the transformations to our data, both categorical and numerical, in a very simple and parallelized way. Also, like the pipeline, you can save the ColumnTransformer to be able to apply these same transformations to the new data that is coming in.

Conclusion

As you can see, Sklearn is a super powerful library to perform Machine Learning in Python that will facilitate your work as a Data Scientist throughout the model creation process.

As always, I hope you liked this Sklearn tutorial on how to do Machine Learning in Python. If so, I encourage you to subscribe to keep up to date with the new posts that are coming up. See you next time!