DVC: How to Create a Data Version Control System for MLOps

Creating a data versioning system is essential when we want to implement our MLOps processes. After all, when we talk about Machine Learning, in order to be able to reproduce the same results, it is not only enough to know the model that has been used and the parameters it had, we also need to know what data this model has been trained with.

The thing is, when we talk about MLOPs and putting Machine Learning models into production, it is essential to know why the model in production has made those predictions, either for legal reasons or, simply, to be able to understand the cases in which the model fails and thus, we need to be able to fix it. And of course, to be able to know how it has made the predictions, we need to check the model and see what the data with which it has been trained was like.

So, in this post we are going to learn how to use one of the most used data versioning tools in the world of Machine Learning: DVC. More specifically we will see how DVC works, how we can install it and how to use DVC as a data version control system for our MLOPs processes. Sounds interesting? Let’s see how DVC creates a data versioning system!

How DVC works

DVC, as its acronym in English (Data Version Control) indicates, is a data version control tool. Version control? Like Git? Yes, indeed, the idea of ​​DVC is to create a version control system for our data that works just like Git.

If you don’t know how Git works, I recommend you read this post where I explain what it is and how to use it in R. Without a doubt, Github is essential if you develop data science projects.

Surely, you may be wondering why use DVC and not use Github directly? The main point is that on Github you can’t save files larger than 100Mb. This may not be a problem if you develop software, but if you train machine learning models, your datasets will most likely be larger than 100Mb.

In contrast, DVC has no limits on the size of the data it can manage. And this is for a simple reason: DVC does not store data on its server. Instead, with DVC you store data in your preferred storage unit, such as Cloud Storage, Amazon S3, Azure Blob Storage, Drive, SFTP, etc. For its part, DVC will automatically save a file of very little weight so that you know where your data has been saved (location, name, etc.).

This way, when you train a model you can refer to the data it was trained on. Since DVC saves this reference to the location where the data is stored, you know what data the model has been trained with.

On the other hand, in addition to allowing for data versioning in a simple way and being agnostic to the platform where the data is stored, DVC also offers a series of very interesting features:

  • It is agnostic to the language we are using. Come on, it serves us whether we use Python, R, Julia or any other language. This is a point that makes its implementation in organizations much easier.
  • It allows the creation of simple data pipelines, in such a way that everyone can visualize the data pipeline in the form of a DAG and, of course, also reproduce it. In addition, each step of the datapipeline is only executed in the event that an element of it (data or file) has been modified, so it allows a much more efficient execution (we will see this issue later).

With that said, let’s see how to get our data versioning system up and running with DVC. Let’s get to it!

Getting started with DVC

DVC Installation

First of all, you will need to download DVC which you can do from this page. Once you have downloaded it, you can run the following command in the console:

dvc version

If this returns any code like the following, then DVC is installed correctly:

DVC Install

Using DVC for the first time

Now that we have DVC installed, let’s initialize our repository with Github. If you don’t know Github in depth, if you are an R user I recommend this post and, if you use Python or any other language, I recommend this post.

Simply, we are going to initialize our repository with a git init. After that, we have to initialize DVC. To do this, we must execute the following command:

dvc init

As you can see, the DVC commands are very similar to those of git. In fact, next we are going to:

  1. Create a folder called data where to store our data (this is not mandatory, but it is a good practice).
  2. Download a file from the internet and save it in our data folder.
  3. Add this file to dvc.
mkdir data
cd date
curl -o data.csv https://raw.githubusercontent.com/anderfernandez/Donostia-empty-parking-lots-forecast/main/data/data.csv
dvc add data/data.csv

As you can see, the command to add files to DVC is exactly the same as for Git. However, once we have executed the previous steps we will see how the following message has appeared in the console:

If we go to the data folder, we will see how, in addition to the data.csv file that we just downloaded, a .gitignore file also appears and another called data.csv.dvc.

On the one hand, the .gitignore file seeks that all the files that we add to DVC are automatically excluded from being added to Github, since the file can be too heavy to use on Github (That’s why we use DVC). Ultimately, DVC saves us from having to do this exclusion ourselves, which is fine.

On the other hand, the file data.csv.dvc is a file that refers to our source file data.csv and includes some hashed metadata (encrypted ) of this file. This is the file that we can upload to Github, since it weighs very little and it is the file that DVC will use to identify the data with which the model has been trained.

Perfect, now you know how to add files to DVC and now you can see the files that will be uploaded to our Git tool. However, where do we actually store the data? Let’s see!

How to store data with DVC

As I mentioned in the introduction, with DVC the data is not stored in Git, but is stored in the storage tool that we want: AWS S3, Azure Blob Storage, Google Cloud Storage, Google Drive, FTP, etc.

In order for data to be saved to this location, we must first add a remote to DVC. The remote in DVC is where the data that has been added to DVC will be stored. One of the very positive aspects of DVC is that it allows you to add many platforms as remote: S3, Cloud Storage, Azure Blob Storage, Google Drive, etc.

In the following sections we will see, step by step, how to include each of the different options that we have previously commented on as a remote.

For the use of DVC you should only add one of the following options as remote, not all of them. I explain all of them so that you know how each of them works and it is easier for you to use it in your case.

How to store data with DVC in Google Cloud Storage

To add data to Cloud Storage we need to have the Google Cloud command line (CLI) downloaded and installed on our computer. You can download it from this page. Once you have it downloaded, you will have to connect to your account, which you can do by executing the following command:

gcloud auth application-default login

Note 1: If you usually work with several Google Cloud accounts on the same computer, as is my case (work and personal), I recommend that instead of running the gcloud auth application-default login command run the gcloud auth login command. And, with the first script, GCloud saves the credentials on the computer and uses them by default, which can cause problems later when switching between accounts.

Note 2: if your MLOps process is going to be automatically retrained, you will not be able to use this Google Cloud account login system, as it requires human interactivity. In that case, you should create a service account and pass its location with the --cred-file parameter when running the login. Example: gcloud auth login --cred-file=/path/to/workload/configuration/file. This post explains how to create a service account in Google Cloud.

Perfect, now that you have the Google Cloud CLI installed, we are going to create our bucket to store the documents. Ideally, each project we work on is a project in Google Cloud and, in addition, has its own bucket where the data is stored. To create a bucket in Google Cloud we go to this link and click the “Create Bucket” button on the top left, as shown below:

When creating the bucket we will have to give it a name. With that name we can add this bucket as remote to DVC with the following command:

dvc remote add -d myremote gs://mybucket/path

Now that we’ve seen how to upload data to Google Cloud, let’s see how to upload data to AWS S3.

How to store data with DVC in AWS S3

In the case of AWS, the procedure to follow is exactly the same as in the case of Google Cloud. First of all, we need to download the AWS CLI, which you can do from this link.

Likewise, in order to log in to the AWS CLI, we must first create a user that has access to the S3 buckets. Creating a user can be done from this link. Once the user is created we will have two pieces of information: the Access Key ID and the Secret access key.

So, having this information, we are going to execute the following command in our computer’s console:

aws configure

When doing so, it will ask us for some passwords. First we must indicate the Access Key ID and then the Secret Access Key. Likewise, there are other issues that we must indicate such as the region, which will depend on where you live. At this link you can see the regions that AWS offers.

Once logged in, we will have to create our storage bucket in S3 (see image below), to which we will have to give a name, in my case it has been dvc-bucket-ander.

Finally, we simply need to add the bucket we just created as a remote to our DVC repository. To do this we must execute the following command:

dvc remote add -d aws_remote s3://dvc-bucket-ander

With this, every time we tell DVC to save the information remotely (which we will see in the next point), it will automatically save it in our S3 bucket.

How to store data with DVC in Azure

In order to store the data in Azure Blob Storage, we must first have the Azure CLI installed, which we can download from this page.

Once downloaded and installed, we must log in to our Azure account, which we can do with the following command:

az login

By doing so, a window will open where we will have to log in with our Azure data. Once this is done we will have our Azure account linked to the CLI.

Now, we only have to create a bucket in Blob Storage where to store the DVC data. To do this, first of all we need to create a storage account, which you can do by following these steps. In my case, the storage account is called anderstorage.

Once you have the account created, to create the Blobg Storage bucket you must:

  1. Go to your Azure storage account.
  2. On the left side, click on “containers” (see image)

There you can create a new container or bucket. By doing so, you will be giving the container a name, which in my case is dvc-example-ander.

So, once you’ve done this, to add Azure as our DVC remote, you simply run the following command lines:

dvc remote add -d azure_remote azure://dvc-example-ander
dvc remote modify azure_remote account_name 'anderstorage'

Once this is done, every time you tell DVC to push the data to the remote repository, it will automatically be uploaded to your repository in Azure Blob Storage.

Other places to store data with DVC

In addition to the places previously mentioned, DVC allows you to save the data in other locations, such as Google Drive, an FTP, HDFS or via SSH.

Personally, I think that, although Google Drive is an option, since to store the data in Google Drive you need to have a Google Cloud account, it is more interesting to store the data in Cloud Storage. In any case, if you want to save the data to Google Drive, here explains how to do it.

Perfect, we already know how to add a remote to save the information remotely. However, how can we upload the information to the remote? Let’s see.

How to upload the information to the remote repository in DVC

If you have followed the steps above you will have a file called data.csv. To upload this file to our remote repository (whatever it is) we simply have to execute the following command:

dvc push

After that, if we have correctly configured our remote repository, our file will be automatically uploaded to said repository. As you can see, the operation is exactly the same as how a Git tool would work.

In addition, we can also download the contents that are already in our remote repository using the following command:

dvc pull 

Great, now we know how DVC works in its most basic form. However, there is much more. And it is that with DVC it ​​helps us in several important issues of our MLOps process, such as capturing the data pipeline or saving the metrics of our model. Let’s see.

Creating Data Pipelines with DVC

When we work on a Machine Learning project, we usually don’t have a single script that does the whole process of data extraction, cleaning, training and evaluation. Instead, each of these steps usually goes in one or more separate files.

The problem comes when a new person from outside the project joins it, since before starting to work he needs to know which processes are followed in which order. This is usually included in the repository documentation, although it can be even easier.

And it is that with DVC we can create a DAG (a sequential graph), in such a way that we can see, in a very visual way, what our Data Pipeline is like. Furthermore, as each step of the pipeline is associated with a file, DVC knows which of those files have not changed, so that when you execute a data pipeline, only those steps whose data have changed are executed. This is very interesting, since it allows to reduce execution time.

Let’s see how to create a data pipeline with DVC. For this we are going to use a real case of predicting the number of free parking spaces in Donostia.

Model Training

To train the model, you must first extract the data. The ETL process is already done through Github Actions, so the data is updated automatically in this repository (there may be cases where the API in origin does not return data or the data is wrong, which will affect the metrics we see in the model results).

If you don’t know about Github Actions, I recommend you take a look at it as it is a very useful tool. You can learn more about it at this post .

So, first we have a file called extract_data.py that has the following information:

import pandas as pd
url = 'https://raw.githubusercontent.com/anderfernandez/Donostia-empty-parking-lots-forecast/main/data/data.csv'
data = pd.read_csv(url)
data.to_csv('data/raw.csv', index=False)

After extracting the raw data, we are going to filter. In our case, we are going to stay only with a parking lot. We do this with the script prepare.py:

import pandas as pd
import yaml

params = yaml.safe_load(open("params.yaml"))["prepare"]

# ReadData
data = pd.read_csv('data/raw.csv')

# Data
parking = params['parking']

# Filter data
data = data.loc[data['properties.name'] == parking, :]\
    .drop(['properties.name', 'timestamp'], axis = 1)\
    .reset_index(drop = True)

# Save the data
data.to_csv('data/prepared.csv', index = False)

In order to train the model, we have a dedicated script, which is as follows:

import os
import pandas as pd
import pickle
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from sklearn.ensemble import RandomForestRegressor
import yaml

params = yaml.safe_load(open("params.yaml"))["train"]

# Data0
seed = params['seed']
steps = params['steps']
lags = params['lags']
output = 'models/model.pickle'

# Read Data
data = pd.read_csv('data/prepared.csv')

data_train = data[:-steps]['properties.libres']
data_test  = data[-steps:]['properties.libres']

forecaster = ForecasterAutoreg(
                    regressor = RandomForestRegressor(random_state=seed),
                    lags      = lags
                )

forecaster.fit(y=data_train)


with open(output, "wb") as fd:
    pickle.dump(forecaster, fd)

Por último, evaluamos el rendimiento de nuestras predicciones sobre los últimos datos del dataset. Esto lo realizamos con el script evaluate.py:

import json
import pandas as pd
import pickle
import sklearn.metrics as metrics 
import yaml

params = yaml.safe_load(open("params.yaml"))["train"]

metrics_file = 'metrics.json'

tmp = pd.read_csv('data/prepared.csv')
predict_steps = params['steps']

with open('models/model.pickle', "rb") as f:
    forecaster = pickle.load(f)

data_test  = tmp[-predict_steps:]
predictions = forecaster.predict(steps = predict_steps)

mae = metrics.mean_absolute_error(data_test, predictions)
mse = metrics.mean_squared_error(data_test, predictions)

with open(metrics_file, "w") as f:
    json.dump({'mae': mae, "mse": mse}, f, indent = 2)

Además, todos los parámetros que utilizo en el modelo se encuentran en el fichero params.yaml, que contiene la siguiente información:

prepare:
  parking: "Easo"

train:
  seed: 1234
  steps: 36
  lags: 15

Así pues, siendo este nuestro proceso de datos, veamos cómo podemos crear un data pipeline con DVC.

How to create a data pipeline with DVC

There are two ways to create a data pipeline in DVC: use the dvc run command or create a dvc.yaml file.

In my opinion, the easiest way is to know the main parameters of dvc run, and in this way DVC itself will take care of creating the dvc.yaml file .

In this sense, the main parameters of dvc run are the following:

  • -n: refers to the name of the pipeline step.
  • -d: indicates the file(s) from which that pipeline step depends.
  • -o: indicates the file or files that are a direct result of the pipeline.
  • -p: used to indicate the parameter file on which our file depends.
  • -m: used to indicate a metrics file that is generated by this step.
  • -f: whether or not you want to force the change of the dvc.yaml

Although these are the main parameters, in this link you can find the detail of all available parameters.

So, if our data pipeline has several steps, we will have to define each of them with dvc run. In our case, the executed commands are the following:

# Add extract data
dvc run -n extract_data \
        -d src/extract_data.py \
        -o data/raw.csv \
        python src/extract_data.py

# Addprepare
dvc run -n prepare \
        -d data/raw.csv -d src/extract_data.py \
        -o data/prepared.csv \
        -p prepare.parking \
        python src/prepare.py

#TrainModel
dvc run -n train \
        -d data/prepared.csv -d src/train.py \
        -o models/model.pickle \
        -p train.seed,train.steps,train.lags \
        python src/train.py

# Evaluate model
dvc run -n evaluate \
        -d data/prepared.csv -d evaluate.py models/model.pickle\
        -o metric.json \
        -p train.steps \
        python evaluate.py

When executing these commands, DVC will have created the dvc.yaml file with the following content:

stages:
  extract_data:
    cmd: python src/extract_data.py
    deps:
    - src/extract_data.py
    outs:
    - data/raw.csv
  prepare:
    cmd: python src/prepare.py
    deps:
    - data/raw.csv
    - src/extract_data.py
    params:
    - prepare.parking
    outs:
    - data/prepared.csv
  train:
    cmd: python src/train.py
    deps:
    - data/prepared.csv
    - src/train.py
    params:
    - train.lags
    - train.seed
    - train.steps
    outs:
    - models/model.pickle
  evaluate:
    cmd: python src/evaluate.py
    deps:
    - models/model.pickle
    - data/prepared.csv
    - src/evaluate.py
    params:
    - train.steps
    metrics:
    - metrics.json:
        cache: false

Perfect, we already have our Data Pipeline defined. But what is this for? Let’s see.

How to use a Data Pipeline in DVC

The first utility is that, having defined our data pipeline, we can visualize it. We can achieve this by executing the command dvc dag. With this, it will show us a graph of our steps in the data pipeline, as shown below:

DAG en DVC

When executing these commands, DVC will have created the file dvc.yaml with the following content:

stages:
  extract_data:
    cmd: python src/extract_data.py
    deps:
    - src/extract_data.py
    outs:
    - data/raw.csv
  prepare:
    cmd: python src/prepare.py
    deps:
    - data/raw.csv
    - src/extract_data.py
    params:
    - prepare.parking
    outs:
    - data/prepared.csv
  train:
    cmd: python src/train.py
    deps:
    - data/prepared.csv
    - src/train.py
    params:
    - train.lags
    - train.seed
    - train.steps
    outs:
    - models/model.pickle
  evaluate:
    cmd: python src/evaluate.py
    deps:
    - models/model.pickle
    - data/prepared.csv
    - src/evaluate.py
    params:
    - train.steps
    metrics:
    - metrics.json:
        cache: false

Likewise, in addition to being able to visualize the pipeline, DVC allows us another very interesting issue: being able to execute and reproduce the complete pipeline. To do this, we simply have to execute the command dvc repro.

In addition, DVC knows if there have been, or not, changes in the files that are generated or used in the pipeline, since they should be registered by DVC. Thanks to this, DVC will only execute a step of our data pipeline if the file or the data that make up that step has changed.

That is, if we were to change a value in the params.yaml file to train the model with more lags, when executing dvc repro DVC would not execute the first steps, since they would not have changed. Instead, it will execute the train step. Also, since the evaluate step depends on train, it will be executed by train as well.

This opens a door for us to be able to carry out small experiments in a simpler way. However, the most important thing in my opinion is the clarity gained by working with several people on the same project, since the DAG can be applied to all types of data pipelines: ETL processes, model training, etc.

Finally, creating the Data Pipelines with DVC opens another great door for us: being able to track experiments from DVC. Let’s see.

How to track experiments in DVC

Requirements to perform experiments with DVC

DVC includes a layer to be able to do simple experiments. For this, the first requirement is that the structure of our repository must be something like this:

├── data/
├── metrics.json
├── models/
├── params.yaml
├── plots/
└── src/

The metrics.json file will be where we store the metrics of our model, as I have done in the previous script. Also, the params.yaml file includes the parameters that we use throughout our data pipeline. This file is mandatory, and you can also find an example of how to use it in the previous section.

In addition, as I have indicated previously, in order to track experiments in DVC, we must have our Data Pipeline defined.

Perfect, now that we know what are the requirements to create an experiment in DVC, let’s see how we can create such experiments.

How to create experiments in DVC

The operation of the experiments in DVC is very simple. We just have to run the following command: dvc experiments run --set-param parameter=value. In our case, we will have to change both the parameter and the value.

For you to better understand the process, what DVC really does is the following: as we already have the pipeline defined and the pipeline parameters are defined in a separate file (params.yaml), instead of using the parameter defined in the file to execute the pipeline, it will use the one that we have defined in the experiment.

In this way, it will execute the entire pipeline and save the results obtained in the pipeline, which should appear in the metrics.json file.

In my case I can try to carry out the experiment making predictions for another number of values. For example:

dvc experiments run --set-param train.steps=400

Once this is done, the pipeline will be executed using that value as the pipeline parameter. If we have launched several experiments, we can compare their results by executing the sequence dvc exp show, which will return the information of the parameters and the data used in said experiment, as shown below:

Experimentos en DVC

As you can see, by default DVC includes information about all the metrics that we have obtained, as well as the parameters with which the experiment has been executed. In this way, we can see, in a very simple way, the different experiments that we, or the rest of the team, have developed, the results obtained and the datasets and files used to reach said objective, which greatly facilitates the experimentation process. .

Conclusions

In my opinion, DVC is a very good tool to standardize data tracking. In addition, it is built in such a way that it is very similar to the use of Git, which makes it much easier to use and greatly reduces the learning curve.

Regarding the registration of models and experiments, I personally believe that there are other types of solutions on the market that are much easier and more powerful, such as Neptune and MLFlow. Regarding Neptune, I would recommend it for organizations that do not have MLEngineers and want to track metadata, artifacts, etc. without complicating too much.

However, if your organization has MLEngineers, it will likely be more interesting to opt for a tool like MLFlow, which, despite being much more complex to implement, includes other types of functionalities, such as putting models into production.

In any case, DVC is a widely used tool in the world of MLOps, as it greatly facilitates certain processes that were previously carried out manually. In this sense, I hope you liked the post. If so, I encourage you to subscribe to be aware of the post that goes up. See you in the next one!

Blog sponsored by: