PyCaret: low-code machine learning in Python

PyCaret is one of the most popular Python libraries for developing Machine Learning models. With Pycaret you can do many things, like:

  • Apply imputation of missing values, scaling, feature engineering or feature selection in a very simple way, just by indicating a few parameters.
  • Train more than 100 machine learning models, of all types (classification, regression, forecasting) * with a single line of code .
  • Register the trained models in MLFlow in a very simple way.
  • Create an API or Docker to put the model into production .
  • Upload your model to the cloud to speed up the deployment in production.

As you can see, PyCaret is useful for many things. For this reason, I consider it to be a tool that every Data Scientist should know. Sound good to you? Let’s go with it!

Installing Pycaret

First of all, to install Pycaret you must have two things in mind:

  1. Pycaret only works with Python versions 3.8 or higher on Ubuntu. So if your python version is higher than 3.8, either downgrade python or else you can try pycaret in google colab or docker. You can check your python version with the following command:
python --version

If you don’t know how Docker works, you can learn more about it in this post.

  1. Pycaret uses version 0.23.2 of Scikit-Learn. Therefore, if you plan to use a newer version of Scikit-Learn in the same project, I would recommend having separate virtual environments.

With that said, let’s install Pycaret:

Note: It is recommended to use virtual environments. However, in this post I am not going to stop to explain how to create/activate a virtual environment. You can learn more about it here .

pip install pycaret

Perfect, we already have Pycaret installed. Now, let’s see how to use Pycaret. Let’s go with it!

How to train models in Pycaret

Pycaret has several modules, each of them specialized in different types of Machine Learning:

  • Supervised Models:
  • Regression or time series:pycaret.regression
  • Classification:pycaret.classification
  • Unsupervised Models
  • Clustering:pycaret.clustering
  • Anomaly detection:pycaret.anomaly
  • Association Rules:pycaret.arules
  • Topic Modelling:pycaret.nlp

As you can see, Pycaret has many different modules. Although this may seem complex, the reality is that the operation in all cases is always the same:

1. Setup definition

It is the most important point for making predictions. At this point, issues such as:

  • Data preprocessing, including normalization, standardization, feature selection , feature engineering , feature generation , etc.
  • Training strategy : indicates issues such as the type of strategy to be applied in validation, the number of folds in which it is going to be applied, the number of CPUs to use in training, whether or not GPU is used, etc.
  • Other questions, such as whether to log the results obtained or the name of the experiment.

This whole process is done using the setup.

2. Model training

In this case there are two approaches available:

  1. Train many different models and see how well each of them performs. You can do this with the function compare_models.
  2. Train a single model of our choice. You can do this with the function create_model, to which you must pass one of the identifiers that appears in the function models().

Personally, the approach that I usually take is the following: when it comes to the first approach to a project, first train many different models to see how each of them behaves.

Once you have identified the models that usually work best, you focus only on those models.

For time series models that are automatically retrained using MLOps, if retraining time is not an issue, it is often interesting to train many different models.

Also, the function compare_modelshas a series of parameters that help a lot in training, such as:

  • include : refers to the ids of the models you want to train.
  • exclude : allows you to indicate the ids of the models that you do not want to be trained.
  • n_select : by default compare_models will always return the model that works best. However, if n_select > 1, it will return a list of the top nbest models. This is interesting if we want, for example, to make an ensemble model.
  • budget_time : indicates the maximum time that a function will be executing. It is very useful if you need the execution of the training to last less than X time.
  • parallel : Allows you to perform training on distributed Spark or Dask systems.

For its part, the function create_modelhas certain interesting parameters such as the following:

  • probability_threshold: allows you to indicate the probability limit for belonging to a class (default is 0.5).
  • By allowing kwargs, the function allows to pass the value of the parameters that will be used by the function to be trained. For example, if we want to train a Random Forest Classifier (id = rf) and always use the value max_depth = 3, we could do the following:
from pycaret.classification import setup, create_model
setup()
create_model('rf', max_depth = 3)

3. Understanding and evaluating the model

With the previous step we would have already selected one or several models. However, we surely want to see how those models work (which variables are most relevant, learning curves, etc.). We can do all this in a very simple way in Pycaret thanks to the plot_model.

More specifically, with the function parameter plot_modelwe can obtain different types of graphs:

plot_modelPyCaret’s function is very extensive. In this post I will cover the most used/common functions. If you want to learn everything that this function offers, I recommend that you read this page .

Likewise, the function evaluate_modelopens a window where you can choose each one of the different types of graphs in order to evaluate the model in a simple and interactive way.

In addition, the function dashboardwill create an interactive dashboard based on Explainer Dashboard.

Finally, the function deep_checkuses the deepchecks library to check whether or not there is a problem in the training process such as Data Leakage.

4. Save the model for deployment

Once we have a better understanding of how and why the model works, we can put it into production.

In this sense, PyCaret helps us to do two types of deployment:

  1. Manual deployments : it consists of saving the model locally so that we can manually put it into production. In this process, the most typical thing is to create an API and Dockerize the service.

If you don’t know how to put a Python model into production, in this post I explain how to do it in Google Cloud by creating an API and using Docker.

  1. Cloud Deployment – ​​This is a set of features to make it easy to upload your model to AWS, Azure, and GCP. Currently it only allows uploading to the Data Lake (S3, Cloud Storage, Azure Storage), the subsequent deployment of the DataLake to the service would be in charge of the ML Engineers.

Functions for manual deployment

In this process, the first thing to do is save the model.

To save a model trained with PyCaret we can use the function save_model, which saves both the trained model and the data preprocessing pipeline as a .pickle.

On the other hand, to be able to load the previously saved model, you can use the function load_model.

Likewise, in case we are going to make the production different from Python (C, Java, Go, C#), we can convert the decision process of our model to these languages ​​using the function convert_model.

However, if we are going to use Python to create an API and then put it into production, PyCaret includes the function create_api, which creates an API based on FastAPI where our model is exposed to POST requests.

If you don’t know how to create APIs in Python, in this post I explain how to create APIs with Flask and FastAPI.

Also, if you want to Dockerize the API, PyCaret has the function create_docker, which allows you to Dockerize the previously created API. In this case, PyCaret will create both the file Dockerfile and the requirements.txt.

If you want to learn how Docker works more in depth, I recommend you read this post.

Now that we know all the ingredients that make up PyCaret, let’s see how to use it in four different cases: regression, classification, time series and clustering models. Let’s go with it!

How to create a regression model with PyCaret

Loading the data

To create the regression model we are going to use the California home price data, which is found in the datasetsSklearn module.

In short, it is a real dataset in which we have information about the house (Number of bedrooms, years of the house, location) and the place (population and income) and with this we will have to predict the price of the houses.

Let’s see what the dataset looks like:

from sklearn.datasets import fetch_california_housing

california_housing = fetch_california_housing(as_frame=True)
california_housing.frame.head()

Setup Definition

Perfect, now that we have the dataset, we are going to import the necessary PyCaret libraries. The most normal thing is to import all the functions, although for greater explainability I will import each of the functions separately:

from pycaret.regression import setup, compare_models, create_model, tune_model, \
  plot_model, save_model

california_housing_setup = setup(
    data = california_housing.frame,
    target = 'MedHouseVal',
    normalize = True,
    transformation = True,
    remove_multicollinearity = True,
    multicollinearity_threshold = 0.8,
    feature_selection = True, 
    ignore_low_variance = True,
    remove_outliers = True,
    imputation_type = 'simple',
    numeric_imputation = 'median',
    silent = True
)

Model Comparison

Now that we have defined the setup, we are going to follow the following strategy:

  1. Train many models to see which ones work best. two
  2. Knowing which model works best, do a specific tuning of that model.

Let’s go with the first step: we are going to compare many models:

best_model = compare_models()

As we can see, 18 different models have been trained and in each of them KFold has been performed 10 times.

In addition, the Light Gradient Boosting Machine model is the one that works best ( MAE = 0.3905) and works significantly better than a very simple model such as Dummy Regressor ( MAE = 0.8989).

Hyperparameter Tuning

Now that we know this, let’s tune our model more precisely. And it is that the tuning that PyCaret performs when training several models is usually quite improvable:

lightgbm_model = create_model('lightgbm')
tuned_lightgbm = tune_model(
    lightgbm_model,
    optimize = 'MAE',
    search_library = 'scikit-optimize',
     n_iter = 50
    )

As you can see, thanks to the function tune_model we can make a much more precise tuning of the model, thus achieving a better tuned model.

In fact, the model has gone from having a test MAE of 0.3905 to having a MAE of 0.3809, which is a considerable gain. And that just by applying a function and in a fairly simple way.

Model understanding and validation

Finally, let’s see how our model is behaving. In addition, we will use deepchecks to check that there has been no data leakage problem .

First of all we are going to check the learning process with a plot of learning.

In the current version of PyCaret (2.3.10) the function plot_model does not return the plot, so subplots cannot be created. However, it is something planned for future versions link.

plot_model(best_model, plot = 'learning')
Training Plot PyCaret

As we can see, as the model has been trained, its predictive capacity has been improving on new data (cross validation) as its capacity on train has decreased. Therefore, it seems that the model is able to generalize correctly.

Now, let’s see how residues behave in train and test:

plot_model(best_model, plot = 'residuals')
Residual Plot PyCaret

As we can see, it seems that the residuals in both train and test follow a normal distribution. Furthermore, in both cases the R2 of the residuals are quite high, so it seems that the model is well adjusted.

So, finally we are going to save the model to put it into production.

Create a Dockerized API with PyCaret

To put the model in production I am going to expose the model in an API and then I am going to Dockerize it to be able to deploy it in Kubernetes (for example).

To create the API I will use the function create_apiand later to create the docker I will use the function create_docker.

from pycaret.classification import create_api, create_docker

create_api(best_model, 'lightgbm')
create_docker('lightgbm')

If we analyze the folder we will see that two new files will have been created: Dockerfileand requirements.txt. Let’s see what each of them has:

# requirements.txt
pycaret
fastapi
uvicorn
# Dockerfile
FROM python:3.8-slim
WORKDIR /app
ADD . /app
RUN apt-get update && apt-get install -y libgomp1
RUN pip install -r requirements.txt
EXPOSE 8000
CMD ["python", "lightgbm.py"]    

With this we would already have our regression model ready to put it into production. As you can see, with PyCaret we have been able to carry out a Machine Learning project in a very simple and fast way.

Now, let’s see how to use PyCaret for a Classification project. Let’s go with it!

How to create a classification model with PyCaret

Data upload

To see how PyCaret works in a text classification project, we must first of all have some data on which to train the model.

In this case, we are going to work on the Breast Cancer dataset, which includes different metrics of different breast cancers and indicates whether the cancer is benign or malignant. This dataset is available within

from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer(as_frame=True)
breast_cancer.frame.head()

PyCaret is one of the most popular Python libraries for developing Machine Learning models. With Pycaret you can do many things, like:

  • Apply imputation of missing values, scaling, feature engineering or feature selection in a very simple way, just by indicating a few parameters.
  • Train more than 100 machine learning models, of all types (classification, regression, forecasting) * with a single line of code .
  • Register the trained models in MLFlow in a very simple way.
  • Create an API or Docker to put the model into production .
  • Upload your model to the cloud to speed up the deployment in production.

As you can see, PyCaret is useful for many things. For this reason, I consider it to be a tool that every Data Scientist should know. Sound good to you? Let’s go with it!

This blog is possible thanks to the sponsorship of Neptune.ai

Neptune.ai is an experiment tracking and model registration tool for teams doing ML at a reasonable scaleMeet Neptune.ai

Installing Pycaret

First of all, to install Pycaret you must have two things in mind:

  1. Pycaret only works with Python versions 3.8 or higher on Ubuntu. So if your python version is higher than 3.8, either downgrade python or else you can try pycaret in google colab or docker. You can check your python version with the following command:
python --version

If you don’t know how Docker works, you can learn more about it in [this post])(https://anderfernandez.com/blog/docker-para-data-science/).

  1. Pycaret uses version 0.23.2 of Scikit-Learn. Therefore, if you plan to use a newer version of Scikit-Learn in the same project, I would recommend having separate virtual environments.

With that said, let’s install Pycaret:

Note: It is recommended to use virtual environments. However, in this post I am not going to stop to explain how to create/activate a virtual environment. You can learn more about it here .

pip install pycaret

Perfect, we already have Pycaret installed. Now, let’s see how to use Pycaret. Let’s go with it!

How to train models in Pycaret

Pycaret has several modules, each of them specialized in different types of Machine Learning:

As we can see in the following image, this dataset has many variables with a high correlation between them. Ultimately, two things happen:

  1. Metrics that are results of transformations of other metrics.
  2. For each metric, the mean, standard deviation and worst value obtained are included. Generally, these three metrics tend to be correlated.

Let’s see it:

import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams["figure.figsize"] = (20, 10) 

sns.heatmap(breast_cancer.frame.corr(), vmin = -1, vmax = 1, annot = True)

Knowing this from our data, let’s see if Pycaret is able to detect the correlation and stick with those variables that have less correlation.

Definition of the setup for classification

Since it was a classification problem, the main functions of PyCaret, such as setupcompare_modelstune_model, etc. are inside the module classification.

On the other hand, since we have multiconlinearity and correlation problems, we are going to tell PyCaret to take care of solving these problems.

To do this we will do the following:

  • To remove the correlation we are going to set the parameter remove_multicollinearity to True. In addition, with the parameter multicollinearity_threshold we can indicate from what level of multilinearity the values ​​will be eliminated. By default the value is set to 0.9.
  • To eliminate multicollinearity, we are going to apply PCA. To do this, we will indicate the parameter pca = True. Likewise, we must indicate either the percentage of variance we have to be able to explain or the number of variables with which we want to stay. We can do this with the parameter pca_components.

Applying PCA to eliminate multicollinearity is not possible in those cases where we need to be able to interpret the model . In these cases we will have to eliminate the variables manually by applying the Variance Inflation Factor or VIF.

So, let’s create the setup for our breast cancer classification project with Pycaret:

from pycaret.classification import setup, compare_models, tune_model, plot_model,\
  save_model, load_model

california_housing_setup = setup(
    data = breast_cancer.frame,
    target = 'target',
    normalize = True,
    transformation = True,
    pca = True, 
    pca_components = 0.8,
    remove_multicollinearity = True,
    multicollinearity_threshold = 0.8,
    ignore_low_variance = True,
    remove_outliers = True,
    imputation_type = 'simple',
    numeric_imputation = 'median',
    silent = True
)

Finally, let’s see how our transformed data looks:

from pycaret.classification import get_config

prep_pipe = get_config('prep_pipe') 
prep_pipe.transform(breast_cancer.frame) 

As we can see we have reduced the dataset and we are left with only 5 main components. Now let’s see how the training of the models works.

Classification model training with PyCaret

As in the case of regression problems, we can use the function compare_modelsto train many different models.

Specifically, PyCarete trains 14 different models, from linear models (Logistic Regression, Linear Discriminant, Ridge) to models based on decision trees (Decision Tree, Random Forest, AdaBoost, Gradient Boosting, etc.) and other types of models such as SMVs, KNN or Naive Bayes.

So, we are going to stay with the 3 PyCaret models that work best:

models = compare_models(sort = 'AUC', n_select = 3)

If we analyze the object models, we will see that it is a list that has 3 different values, each of them with the trained model:

models
[LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=1000,
                    multi_class='auto', n_jobs=None, penalty='l2',
                    random_state=7632, solver='lbfgs', tol=0.0001, verbose=0,
                    warm_start=False),
 QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
                               store_covariance=False, tol=0.0001),
 LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
                            solver='svd', store_covariance=False, tol=0.0001)]

This time, as the models that work best are relatively simple models, we are not going to tune these models any further. Instead, we are going to dwell on evaluating the performance of these models.

Evaluation of the performance of the models

As I have indicated in the theoretical part, one of the keys to PyCaret is that it greatly facilitates training, but also the evaluation of models.

In this sense, one of the interesting evaluations is that of deepchecks, which generates a report including several graphs of areas that the model approves or does not:

from pycaret.classification import deep_check
deep_check(models[0])

As we can see, the model is perfectly calibrated, despite the fact that in some segments the predictive capacity of the model is 75%.

With this we could now save the model to make predictions, as we have done in the case of regression.

How and when to use PyCaret

In my opinion, PyCaret is a very interesting library for generating low-code Machine Learning models for several reasons:

  • It allows you to train different models very easily, so it takes less time than if you did it manually in Sklearn. These models are used for both regression and classification.
  • It allows to perform data preprocessing in a simple way by indicating different parameters in the function setup.
  • It includes many functions and libraries to be able to interpret the models in a very simple way.
  • It helps you create the API and Dockerfiles for production.

Although PyCaret is very powerful, in my opinion it is not ideal in all circumstances.

Personally, I find PyCaret very interesting as a first approach in a Data Science project that requires Machine Learning.

The reason is simple: in a simple way, PyCaret allows you to train many different models, so, in a simple way, you can know which models are the ones that possibly work the best.

However, in order to fine-tune and improve the predictive capacity of the model, I usually choose to perform manual optimizations of the hyperparameters, as well as other options such as stacking models (if the project allows it).

Although PyCaret can allow optimization using other libraries such as SkOptimize, it does not provide many advantages in this regard. Also, it is not intended to train stacking models either.

In addition, the current version of PyCaret is very inflexible in certain matters, such as the creation of the file logs.log, which is generated in the root folder of the project. This adds certain steps to using PyCaret in MLOps processes in certain tools like Cloud Functions.

In any case, PyCaret is a great tool that, personally, I think every Data Scientist or Machine Learning Engineer should know about.

I hope that this post has helped you to learn more about PyCaret and that it will be useful in your day to day when creating Machine Learning models.