Docker for Data Science
As a data scientist, for sure you have created models and code that are very useful. However, if you don’t know how to put them into production, they are not that useful. In this sense, Docer is a key software that will help you putting code and applications into production.
That is why, in this post I will show you step by step what Docker is and how it works. Besides, this post combines perfectly with the post where I explain how to put a model into production. If you haven’t read it, I encourage you to do so.
That bein said, let’s see how you can use Docker for Data Science!
Introduction to Docker
What is Docker
Suppose you have an application (be it R or Python) on your computer and you want to put it into production. You have developed the application on an Operating System, with a programming language with a specific version and libraries with specific versions as well.
If you want to make sure that your application works on the server, the server will have to have exactly the same as you have on your computer. If not, there might be incompatibility issues.
As you will understand, this is something unaffordable: a Cloud service cannot have all the Operating Systems, nor all the languages nor all their versions. So, what do we do then? This is when Docker kicks in.
Docker is a free software platform that allows you to package and isolate your software. That is, with Docker you can create an element (called Image) that will include everything you need to run your application.
Why use Docker as a Data Scientist
Considering the above, thanks to Docker you will be able to run your application (container) on any system with Docker, regardless of its operating system or whether or not it has the language and libraries your application needs.
This is very interesting especially in two situations:
- Putting ML products into production : If you want to put a model or an application that uses ML in production, Docker will be a very clear way to do it. After all, all Cloud platforms offer the possibility of uploading a Docker image and deploying it in several of its services.
- Creating an isolated environment . Many times the different softwares can give problems. For example, I have had this problem when running Keras from R on a Mac M1. In those cases an option is to create your code and run it inside a Docker image, in such a way that you make sure it works.
Now you know what Docker is and why it is important to know it if you dedicate yourself to Data Science, but … how does it work? Let’s see how you include your application in Docker!
How to Dockerize an application
To Dockerize an application we will have to follow the following steps:
- Install Docker Desktop on your computer.
- Create a Dockerfile.
- Create the Docker image from the Dockerfile.
- Launch a Container.
As you can see, there are four steps, so let’s see the first one:
How to install Docker Desktop on your computer
Installing Docker on your computer is very simple. If you have Windows you just have to go to this page ( link ) and download the application. If you use Mac instead, you can download Docker from this other link.
Once you have downloaded and installed the application, you will be able to access it. At the moment we are not going to use it, but to make sure it works, you will have to see something like this (at the bottom left should be green).
Also, another option to verify that you have installed Docker correctly is by opening a terminal. In it, execute the following command:
If docker is installed correctly you will see something like this:
Well, you already have Docker installed. Now let’s see what a Dockerfile is and why it is essential in Docker (whetherit is for Data Science or not).
What is Dockerfile and how to create it
What is a Dockerfile
Dockerfile is a file that indicates Docker the instructions to follow to include everything necessary in the image: the application, the programs you need, the installation of the libraries, etc.
At the end of the day, what is behind Docker is a Linux system. Therefore, we will have to install the programs we need, including our code, and tell Docker to run it.
For this, Docker has several commands that allow us to know what elements to include in our image. The most typical instructions are
ENTRYPOINT. Let’s see what each of them do:
Create a Dockerfile: FROM
This verb indicates the base image from where we are going to start. As I said before, Docker is based on installing everything we need to execute our code on a Linux system. Does this mean that you are going to have to install Python/R, TensorFlow, etc. from the command line? No, we don’t.
The reason is that the community (or companies) usually create images that already include generic content. For example, if you use Python there are images that already have Python installed. There are even images that already have everything you need to run Tensorflow. And exactly the same thing happens with R: there are images that have R installed, RStudio, Shiny… everything.
To find these images, the most common thing to do is to search for the image in the Docker image repository. If you search for “TensorFlow”, for example, you will find the official Tensorflow image (link). In the case of R, the most common images are maintained by rocker.
For example, suppose we have created some code in Python and we have created an API with FastAPI (in this post I explain how to create APIs in Python). We want to put this API into production, so we have decided to build a Docker.
If we look at the FastAPI documentation you will see that they indicate that this image Docker includes everything Required to use Fast API.
If you notice, the name of the image is
tiangolo/uvicorn-gunicorn-fastapi. So, if we want to create a Docker image that includes FastAPI, we will have to start from this image. This is as simple as including the following line in our Dockerfile:
FROM tiangolo / uvicorn-gunicorn-fastapi
Perfect, now we know how we can start from an already defined image. But what if we want to install something else on Docker? Let’s continue with our Docker for Data Science tutorial, looking at the Dockerfile instructions for executing commands!
The RUN instruction will execute a command in the console. This will help us to install new software that we need but that the image we started from does not include. For example, this can happen if you want to access a database.
RUN can have two different forms:
RUN command: executes the command in the shell.
[executable parameter1 parameter2]. Allows you to pass parameters to the execution.
For example, if we want to connect to PostgreSQL, we will have to install the
libq-dev package. For this, our Dockerfile would have to include the following instruction:
RUN apt-get update RUN apt-get install -y libpq-dev
Also, in order not to include each line in a different instruction, we could put them together in a single instruction as follows:
RUN apt-get update \ && apt-get install -y libpq-dev
In our case, we will use the
RUN command to install the dependencies of our application from the
requirements.txt file and also to create a new directory.
RUN pip install -r requirements.txt RUN mkdir -p app
Now you know how to install the software you need, let’s see how you can include your code in the Docker image!
Dockerfile: COPY, ADD and WORKDIR
ADD are used to add content to our Dockerfile, although
ADD offers more options than
COPY only copies local files, with
ADD you can add files from a URL.
Besides, if you add a local compressed file (like a
ADD will unzip it into a new folder.
ADD seems far superior to
COPY, unless you are using some of the
ADD functionality, Docker recommends using
COPY take two parameters:
- Source path : indicates the path where the file you want to copy is located.
- Destination path : path in the you want to “paste” the file. If you don’t want to paste it into a specific path, you can specify that it be pasted into
/, so that the element is pasted into the Working Directory.
Likewise, to indicate the Working Directory we will use the verb
WORKDIR . The
WORKDIR instruction allows you to indicate where the
COPY and ADD that follow. Also, if necessary, it is possible to change the WORKDIR several times in the same Dockerfile (although this is not very common, at least if we use Docker for Data Science).
Following the example, suppose we have the following structure:
. ├── app │ └── main.py │ ├── Dockerfile │ └── requirements.txt
main.py file includes the same code that I used in the post on how to create APIs in Python:
from fastapi import FastAPI from fastapi.responses import StreamingResponse app = FastAPI() @app.get("/get-iris") def get_iris(): import pandas as pd url ='https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv' iris = pd.read_csv(url) return iris @app.get("/plot-iris") def plot_iris(): import pandas as pd import matplotlib.pyplot as plt url ='https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv' iris = pd.read_csv(url) plt.scatter(iris['sepal_length'], iris['sepal_width']) plt.savefig('iris.png') file = open('iris.png', mode="rb") return StreamingResponse(file, media_type="image/png")
So, we will have to include our file in our Docker image. To do this, we would simply include the following instruction:
# Create the folder RUN mkdir -p app # Paste the folder COPY ./app app
Let’s see the last point, how to run our application
Dockerfile: EXPOSE, ENTRYPOINT and CMD.
If your code will be listening on a port,
EXPOSE allows you to define the port on which your application will be listening.
On the other hand,
ENTRYPOINT allows you to define the command to be executed when the Docker image is launched. For its part,
CMD allows you to define the arguments that
ENTRYPOINT is going to use.
For example, suppose we are going to run our FastAPI API, and we want the user to be able to define the host and port on which it runs. That is why we will use the CMD instruction. More specifically, the instruction will be the following:
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080"]
Finishing the Dockerfile
Considering all the above, our Dockerfile will look like this:
FROM tiangolo/uvicorn-gunicorn-fastapi COPY requirements.txt . RUN pip install -r requirements.txt RUN mkdir -p app COPY ./app app EXPOSE 80 CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "80"]
Now, we will have to save it as a file with no extension. For this, you will need a text editor like Notepad ++ or Sublime Text. Also, the file must be called
Dockerfile (otherwise, it won’t do).
With this, we have our Dockerfile ready and we can create our Docker image, let’s see how.
How to create a Docker image from a Dockerfile
Once we have our Dockerfile, we must mount our Docker image.
Once the terminal is open in the folder, we must execute the following command:
docker build .
Important: there should be no spaces after the dot, otherwise it will not work.
In addition, we can set a lot of parameters when mounting the image, although the most used are usually:
-t: allows you to give a name and a tag to the image. The name will allow us to differentiate the images, and the tag will allow us to differentiate the versions. This is a parameter that I personally always include. In fact, in this example I have executed:
docker build -t app_prueba .
--no-cache: if you repeat the build of an image, prevent it from using the cache.
-m: Lets you limit the memory that the process will use. In some OS, Docker itself limits memory usage to 2GB, so with this we could allow you to use more.
Once you launch the command and it runs correctly, you will see your image in Docker Desktop:
If you’ve made it this far, congratulations, you’ve mounted your first Docker image.
Now, let’s see how to launch the application.
How to launch a Docker image
Launching a Docker image locally is super easy. Once you have the Docker image, you must put the mouse over it and the “Run” button will appear to the right of it. By clicking on it, a window like the following one will appear:
In this window, we can indicate several questions, such as the name we want this container to have or the port we want it to run on. If you do not indicate anything, it will use the default values and a name will be invented.
On the other hand, another way to launch the application (the one that I personally use) is through the console. To do this you simply have to indicate the:
- Name of the application.
- Port you are listening on.
- Port the container is listening on.
So, considering the Dockerfile that exposes the app on port 80, I’m going to ask it to listen on port 80 of the localhost. I do this by launching the container with the following command:
docker run -dp 80:80 -name example_app app:001
Personally, I always recommend indicating at least the name, since it makes it easier for us to understand what each thing is in case we have several containers.
So, if everything works we will see that the application has been launched correctly:
We already have it! Finally, it only remains to access this application, which we can do from our computer (in my case on port 80):
Without a doubt, Docker is a fundamental tool for any Data Scientist, since it brings us very close to putting applications and models into production (something I will talk about later taking advantage of this post).
So, I encourage you to practice with Docker and, if you are interested in putting models and applications into production in Cloud environments, I recommend you subscribe to be alert when you write about it. See you in the next post!