MLflow tutorial: MLOps made easy

Without a doubt, creating models is important. However, as companies adopt and put models into production, it is becoming increasingly important to know how to manage those models in production. This is known as MLOps, and while there are many tools and ways to do it, a widely used one is MLflow. Therefore, the objective of this MLFlow tutorial is to teach you how to put it into production, as well as the basic operation of MLFlow, both at a theoretical and practical level. Sounds good to you? Well let’s get to it!

Introduction to MLflow

MLflow is an Open Source tool to manage the life cycle of machine learning models. To do this, it has several main aspects:

  • Tracking : records the results and parameters of the models to be able to compare them.
  • Projects : packages the code in such a way that be reproducible.
  • Models : allows you to manage the versioning of models, as well as put ML models into production as an endpoint. The latter is a very interesting aspect, since it includes integrations to deploy the model in both Azure ML and AWS SageMaker. In addition, it allows the export of the model as an Apache Spark file.

When using MLflow you should not necessarily use all of its capabilities. In my opinion, the Projects section is not that powerful. Also, saying that it is not the only way to do MLOps, although it is a very interesting tool in many cases.

As you can see, MLflow covers many of the key issues in model development. In any case, the first step will always be the same: install MLFlow on a server so that anyone on the team can use it. Let’s see how to do it.

How to install MLflow

To put a server with MLflow into production we will need two questions:

  1. A virtual machine . In this virtual machine we will install MLflow, we will be able to see the MLflow UI, it will serve models, etc.
  2. A database . The database is where MLflow will store the parameter tracking metadata. In addition, it cannot be any type of database, but must be one of the following: MySQL, SQLite or PostgreSQL.
  3. A place to save artifacts : this is the place where we will keep the models. In this case, we can use the virtual machine itself, although the most typical thing is to use a Datalake such as S3 or Cloud Storage, for example. In our case, being in the Google Cloud environment, we will use Cloud Storage, since it will facilitate our work.

So, let’s continue with the MLflow tutorial, seeing how to install it in a virtual machine:

How to install MLflow on a virtual machine

In order to install MLflow, we must first have a virtual machine running. This can be done on any Cloud platform or its own service, although in my case I will do it on GCP, since it offers a free e2-micro virtual machine per month .

In the case of GCP we must go to Compute Engine and create a new instance, as it appears in the following image:

Create Compute Engine Virtual Machine

Once you have created it, access the virtual machine through SSH by clicking on the text where SSH appears. Once you’ve clicked, a window will open. Perfect, you are already inside the virtual machine. Now first of all we have to install Miniconda. To do this we execute the following code:

# We install conda
curl -O -J https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

You will have to hit enter until you reach the end of the document and finally accept the terms. Finally you must activate conda:

source .bashrc

Finally, we have to install MLflow. For this there are two options:

  • Option 1 : Install MLlflow with all the packages it can use, such as Sklearn, Azure, etc. In this way, you avoid having to go around installing several packages. To do this, you must execute the following command:
pip install mlflow[extras]
  • Option 2 : Simply install MLflow without any extras. In which case, we will have to install the extras that we use ourselves. To do this, we simply have to execute the following command: pip install mlflow

In my case, to facilitate the MLflow tutorial I will install it with the extras, that is, option 1.

Now, let’s create a rule on the firewall. This is necessary to be able to access the MLflow UI, as well as to put models into production. To do this, we have to go to the Firewall section within VPC Networw ( link ). At the top we can create a rule, in which we will have to define the following points:

  • Targets: Specified target tags
  • Target tags: mlflow (or the name of the tag you gave it when creating the VM).
  • Source filter: IPV4 ranges
  • Source IP ranges: 0.0.0.0/0
  • Specified protocols and ports tcp (Tcp specified protocols and ports): 8080,8081,1234.

If later you are going to display a model on a port (for example port 1234) remember to add it to the rule to be able to access it.

The rule should look like the following:

Regla de Firewall para VM

Once we have the rule created, we return to the VM console and execute the following command:

mlflow ui -p 8080 -h 0.0.0.0

Now if we go to the external IP of our virtual machine (it is indicated in the Compute Eninge UI and has the following form: xx.xxx.xxx.xxx where x are numbers) and we put the port 8080 we can access the MLflow UI. The url will look something like this: 12.345.678.912:8080 .

Perfect, we already have MLflow installed on the virtual machine. Now, let’s see how to create a database so that it can be used as a repository for model logs.

How to connect a database to MLflow

The database is necessary to store information such as the parameters used in the model or the metrics obtained by the model.

The first thing of all to connect the database to MLflow is, obviously, to create the database. As we have said, I will create the database within the virtual machine itself So, for this I will execute the following code:

sudo apt-get install postgresql postgresql-contrib postgresql-server-dev-all -y

Now that we have the database, we connect to it:

sudo -u postgres psql

Now that we are connected we will have to create the database where the information will be saved.

CREATE DATABASE mlflow;

In addition, we will have to create a username and password to be able to access the database. In addition, we will have to give the user permissions:

CREATE USER mlflow WITH ENCRYPTED PASSWORD 'your_password';
GRANT ALL PRIVILEGES ON DATABASE mlflow TO mlflow;

Finally, we have to install the necessary libraries so that internally MLflow can connect to this database to read and write data. Being Postgres we will have to execute the following command:

sudo apt install gcc -y 
pip install psycopg2

With this we already have the database created. Now, we can continue with the MLflow tutorial seeing how to create the artifact repository.

How to add the artifact repository to MLflow

The artifact repository is where all the files that are not metrics or parameters will be stored, such as trained models, data, images, or any other file that we want to save.

As I mentioned previously in the MLflow tutorial, there are two ways to add an artifact repository: either use the virtual machine itself or use a datalake (S3, Cloud Storage, etc.). In our case we will use Cloud Storage.

To use Cloud Storage with MLflow we have to follow three steps:

  1. Create the bucket in Cloud Storage.
  2. Give access to the Compute Engine service account.
  3. Allow access to Cloud Storage at both server and server level client (the client needs access to be able to write artifacts, while the server needs to read and use them).

So, we go through steps:

How to create a Bucket in Cloud Storage for MLflow

Creating a Bucket is very simple. We simply have to go to create a bucket in Cloud Storage and give it a name and assign the type of storage that we want.

Once we indicate the bucket, if we go to Configuration – & gt; Gsutil URI we will see the URI of our bucket, which will have a form like the following: gs: // bucket_name

Perfect, this point made. Let’s continue with the MLflow tutorial seeing how to give access to the service account.

How to give Compute Engine service account access

First of all we have to go to the users section within IAM ( link ). Once there, we copy the email of the compute service account, which has the following structure xxxxxxxxxxxxx-compute@developer.gserviceaccount.com.

Once we have it copied, we return to our bucket and, this time, we go to the Permissions section. At the bottom we will see the add button. We click and paste the service account that we have copied. In the functions section we will add the Storage Object Admin function. This way the service account will be able to create and access the artifacts.

Perfect, now it is only necessary for both the server and the client to have access to this bucket. Let’s see how to do it:

Allow access to Cloud Storage to server and client

Since the virtual machine is server-side and in the same project as the bucket, nothing needs to be done. Regarding the client, that is, our computer, in order to access Cloud Storage we first have to download the keys of the Compute Engine service account. To do this, we return to the Service Accounts section within IAM (link ) and in the three points we click on “Create Keys”. Once this is done, we will download a .json file with our service keys.

We return to our development environment and install the following library to access Cloud Storage:

pip install google-cloud-storage

Finally, we simply have to take the service keys to the folder where we have our ML project and execute the following code:

from google.cloud import storage
credentials = 'service_account.json
client = storage.Client(creds = credentials)

Perfect! We already have our Cloud Storage mounted and ready to be used by our MLflow server. Now, let’s see how to launch the server. Let’s get to it!

How to launch the server with MLflow

To launch the server with MLflow we have to keep 3 questions in mind:

  1. Host and port: the same that we have defined in the Host and port when creating the Firewall rule, that is, host 0.0.0.0 and port 8080 .
  2. The URI of the database. As always, the format will be as follows: postgresql: // & lt; user & gt;: & lt; password & gt; @ & lt; host & gt; / & lt; database & gt;

In my case, the URI of the database is the following:

postgresql://mlflow:your_password@localhost/mlflow
  1. Artifact repository path. As we have seen previously, this path has the following form: gs: // bucket_name .

With these three points, we can launch the MLflow server with the following code:

mlflow server --backend-store-uri <db_URI> --default-artifact-root gs://bucket_name -h <host> -p <port>

Perfect! We already have MLflow installed on the virtual machine. Now let’s continue with the MLflow tutorial, seeing how each section works. Let’s go there!

How to use MLflow Tracking

Important-concepts about MLflow Tracking

Saving the tracking of our models in our MLflow server is very simple. We simply have to:

  1. Connect to our remote server with MLflow. To make the connection, we simply have to use the set_tracking_uri method and indicate the IP of our server and the port. Example:
import mlflow
mlflow.set_tracking_uri('http://xx.xxx.xxx.xxx:8080')

With this, we will have made the connection with our MLflow server

2. Optionally, we can create an experiment to include the parameters, if we don’t already have it. Each different project should have its own experiment. To do this, we will use the create_experiment method:

experiment_name = "experiment_iris"

if not mlflow.get_experiment_by_name(experiment_name):
    mlflow.create_experiment(name=experiment_name)

3. When training the model, pass logs of the parameters, metrics, artifacts and the model. There is the following data that we can include in MLflow:

  • Model parameters : indicates parameters of the model used. It is logged using the log_param method.
  • Metrics : refers to performance metrics, such as RMSE, accuracy, AUC, etc. It is logged using the log_metric method.
  • Artifacts : allows you to include files and / or folders. Typical use is to include training data, training images, etc. Artifacts are logged using the log_artifact method.
  • Models : allows you to include models. Models are logged using the log_model method. Additionally, the Sklearn, Tensforlow, Keras, Gluon, XGBoost, LightGBM, Statsmodels, Spark, Fastai, and Pytorch libraries are autologing enabled. That is, we can simply use the autolog method and MLflow will automatically log the data that we generate. You can learn more about the autlog here .

Important : to be able to log the parameters we must first tell MLflow to “listen”. We can do this in two ways:

  1. Use the star_run method together with with to avoid closing the process. Example:
with mlflow.start_run():
   mlflow.log_param('max_depth', max_depth)
   #...

Use the start_run and end_run methods:

mlflow.start_run() 
mlflow.log_param('max_depth', max_depth) 
#...
mlflow.end_run()

With that said, let’s see how it works with a very simple example:

MLflow Tracking example

As an example we are going to create a model that predicts the flower type of the Iris dataset. To do this, we will create a Random Forest model with Sklearn applying Grid Search and Cross Validation and we will log the parameters that work best.

It is also important to remember two key points:

  1. Connect to our Data Lake, in my case, Cloud Storage, so that we can save the artifacts there.
  2. Make the MLflow setup using set_tracking_uri and, optionally create and define the experiment.

In addition, we will log the training data and the resulting model.

If you don’t know any of the Sklearn functions that I use or would like to delve into this library, I recommend that you read this Sklearn tutorial .

# Load libraries
import importlib
import mlflow
importlib.reload(mlflow)
import numpy as np
import mlflow 
import mlflow.sklearn
from sklearn.datasets import load_iris
from sklearn.metrics import precision_score, accuracy_score, recall_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split

# Connect to Cloud Storage
service_account = 'credentials.json'
from google.cloud import storage
client = storage.Client(creds = service_account)

# If the experiment doesn't exist, I create it
experiment_name = "experiment_iris"

if not mlflow.get_experiment_by_name(experiment_name):
    mlflow.create_experiment(name=experiment_name) 

experiment = mlflow.get_experiment_by_name(experiment_name)

# Setup MLflow
mlflow.set_tracking_uri('http://XX.XXX.XXX.XXX:8080/')

# Load the data
data = load_iris()

# Split data in train & test
x_train, x_test, y_train, y_test = train_test_split(
    data['data'],
    data['target'],
    test_size= 0.2,
    random_state= 1234
    )

# Define the model
rf_class = RandomForestClassifier()

# Define hiperparameter grid
grid = {
    'max_depth':[6,8,10], 
    'min_samples_split':[2,3,4,5],
    'min_samples_leaf':[2,3,4,5],
    'max_features': [2,3]
    }

# I do Grid Search
rf_class_grid = GridSearchCV(rf_class, grid, cv = 5) 
rf_class_grid_fit = rf_class_grid.fit(x_train, y_train)

print(f'Best parameters: {rf_class_grid_fit.best_params_}')
Best parameters: {'max_depth': 6, 'max_features': 2, 'min_samples_leaf': 2, 'min_samples_split': 2}

Now that we have the model trained, let’s do the logging in MLflow:

# I log parameters into MLflow
with mlflow.start_run(experiment_id = experiment.experiment_id):

    # I log the best fitting parameters 
    mlflow.log_params(rf_class_grid_fit.best_params_)

    # I get predictions
    y_pred = rf_class_grid_fit.predict(x_test)

    # I calculate acuraccy, precission & recall
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    print(f'Accuracy: {accuracy}\nPrecision: {precision}\nRecall: {recall}')

    # I log parameters
    metrics ={
        'accuracy': accuracy,
        'precision': precision, 
        'recall': recall 
        }

    mlflow.log_metrics(metrics)


    # Log model & artifacts
    np.save('data/artifacts/x_train', x_train)
    mlflow.log_artifact('x_train.npy')

    mlflow.sklearn.log_model(rf_class_grid_fit, 'iris_rf_first_attempt')
Accuracy: 1.0
Precision: 1.0
Recall: 1.0

Perfect! We already have our model created and it has been tracked in MLflow. Now, if we go to the UI we will see that the model has three parts.

On the one hand, we have the parameters of our model, in such a way that we know with what parameters that particular model has been trained:

Model parameters in MLflow

Likewise, in the metrics section we can also see what the metrics that we have achieved with this model have been. In my case we have only included final metrics, but if we had trained a neural network, for example, we could have saved other metrics such as the validation accuracy of each iteration. In those cases, from the UI we can see the evolution of the metric.

Model metrics in MLflow

Finally, the artifacts section includes both the artifacts, in my case, the data, and the model. Anything we save other than parameters or metrics will be an artifact.

Artifacts in MLFlow

As you can see, it is very easy to register models and parameters in MLflow. But MLflow goes much further. Let’s continue with this MLflow tutorial, seeing what for me is something key to this software: putting models into production. Let’s get to it!

Commissioning of models with MLflow

Once we have a model uploaded to MLflow we can put it into production in a very simple way as an API. To do this, we go to the “Artifacts” tab of the model that we want to put into production. There, we can click on the “Register Model” button which will open a window like the following one:

Register model in MLflow

We simply have to indicate the name of the model, which, in my case, is iris (the name simply serves to know what model it is and see different versions of it).

Now in order to obtain predictions there are two options:

  • Read the artifact from the MLflow server and use it to make predictions.
  • Publish the artifact as an endpoint.

Let’s see how to do each case:

Make-predictions of an MLflow-model

To obtain the predictions, we simply have to go to the artifacts section and we will see two how we can obtain predictions from our model, either using Spark or Python. So, we simply copy that code and execute it giving some real data that we want to predict:

logged_model = 'gs://mlflow_artifacts_bucket/artifacts/7/72c46af3d3f649569c4df0c7cdfeb263/artifacts/iris_rf_first_attempt'

# Load model as a PyFuncModel.
loaded_model = mlflow.pyfunc.load_model(logged_model)

# Predict on a Pandas DataFrame.
import pandas as pd
loaded_model.predict(pd.DataFrame(x_test))
array([1, 1, 2, 0, 1, 0, 0, 0, 1, 2, 1, 0, 2, 1, 0, 1, 2, 0, 2, 1, 1, 1,
       1, 1, 2, 0, 2, 1, 2, 0])

This way of putting the model into production requires that the client that is going to make the requests can interact with the MLflow API, that is, it must be done from Python, R or Java. If we want to make the prediction easily with REST, I would opt for the following method.

Having seen how to put an MLflow model in production, let’s continue with our MLflow tutorial seeing how we can publish a model as an endpoint. Let’s get to it!

How to publish an MLflow model as an endpoint

In order to put an MLflow model in production as an endpoint we can do two things:

  1. Deploy to the MLflow server itself.
  2. Deploy to an external Cloud tool, such as AWS SageMaker, Azure ML or Apache Spark UDF.
  3. Download the model as Docker and put it into production in any other tool.

In my case I will use the first option, since it is the simplest and, for medium models it works well. However, for models with much more traffic, I would recommend using the second bullet.

That said, to put an MLflow model into production as an endpoint, we simply have to copy the model’s URI, which is found at the top of Artifacts, next to Full Path and run the following code:

mlflow models serve -m "<model_uri>" -p <port> -h <host> --no-conda

It is important that the endpoint port is different from the port the UI is on. In my case, I have used port 1234.

Once this is done, MLflow will display the model on the port that we have indicated and now we will simply have to make POST requests to the invocations endpoint on the indicated port:

import requests
import pandas as pd

url = 'http://XX.XXX.XXX.XXX:1234/invocations'

headers = {'Content-type': 'application/json'}
data = pd.DataFrame(x_test).to_json(orient='split')

resp = requests.post(
    url,
    headers=headers,
    data = data
    )

resp.content
b'[1, 1, 2, 0, 1, 0, 0, 0, 1, 2, 1, 0, 2, 1, 0, 1, 2, 0, 2, 1, 1, 1, 1, 1, 2, 0, 2, 1, 2, 0]'

As we can see, now we have the model in production in API mode and we can access it very easily through HTTP requests.

Conclusion

Without a doubt MLflow is an incredible tool for putting models into production and MLOps. And it is a very good tool both to record the models, their parameters, metrics and input data, but also to put these models into production.

Besides, it is a platform that is agnostic to the ecosystem, you can implement it using different databases, different types of file systems and you can put the models in production on the server itself or very easily in AWS and Azure.

I believe that the most costly thing about MLflow is putting it into production and getting all the pieces to fit and work. In any case, I hope that this MLflow tutorial has helped you to learn how to put MLflow into production and to know how to use the Tracking and Deployment section of MLflow models.

If so and you want to be up to date with more content like this, I encourage you to subscribe to be up to date with the new posts that I am uploading. See you in the next post!