How to put Python model into Production

Create a Machine Learning model with Sklearn

Reading and cleaning data

First of all, have some data on which to make predictions. To do this, I have used this dataset, which includes information on the price of houses in Madrid.

In this way, the objective of the project is to create a model that, with 4 or 5 few variables, allows predicting the price of a house in Madrid. Thus, when we have the model in production, we can create a form with these fields for people to estimate the price of their homes in Madrid, in a simple way.

Note : in this post I am not going to focus on the options that Sklearn offers to create models. If you want to learn more about Sklearn and all the options it offers you, I recommend that you read this post.

So, first of all we read the data:

import pandas as pd

url = 'https://raw.githubusercontent.com/anderfernandez/datasets/main/Casas%20Madrid/houses_Madrid.csv'
data = pd.read_csv(url )

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21742 entries, 0 to 21741
Data columns (total 58 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Unnamed: 0                    21742 non-null  int64  
 1   id                            21742 non-null  int64  
 2   title                         21742 non-null  object 
 3   subtitle                      21742 non-null  object 
 4   sq_mt_built                   21616 non-null  float64
 5   sq_mt_useful                  8228 non-null   float64
 6   n_rooms                       21742 non-null  int64  
 7   n_bathrooms                   21726 non-null  float64
 8   n_floors                      1437 non-null   float64
 9   sq_mt_allotment               1432 non-null   float64
 10  latitude                      0 non-null      float64
 11  longitude                     0 non-null      float64
 12  raw_address                   16277 non-null  object 
 13  is_exact_address_hidden       21742 non-null  bool   
 14  street_name                   15837 non-null  object 
 15  street_number                 6300 non-null   object 
 16  portal                        0 non-null      float64
 17  floor                         19135 non-null  object 
 18  is_floor_under                20572 non-null  object 
 19  door                          0 non-null      float64
 20  neighborhood_id               21742 non-null  object 
 21  operation                     21742 non-null  object 
 22  rent_price                    21742 non-null  int64  
 23  rent_price_by_area            0 non-null      float64
 24  is_rent_price_known           21742 non-null  bool   
 25  buy_price                     21742 non-null  int64  
 26  buy_price_by_area             21742 non-null  int64  
 27  is_buy_price_known            21742 non-null  bool   
 28  house_type_id                 21351 non-null  object 
 29  is_renewal_needed             21742 non-null  bool   
 30  is_new_development            20750 non-null  object 
 31  built_year                    10000 non-null  float64
 32  has_central_heating           13608 non-null  object 
 33  has_individual_heating        13608 non-null  object 
 34  are_pets_allowed              0 non-null      float64
 35  has_ac                        11211 non-null  object 
 36  has_fitted_wardrobes          13399 non-null  object 
 37  has_lift                      19356 non-null  object 
 38  is_exterior                   18699 non-null  object 
 39  has_garden                    1556 non-null   object 
 40  has_pool                      5171 non-null   object 
 41  has_terrace                   9548 non-null   object 
 42  has_balcony                   3321 non-null   object 
 43  has_storage_room              7698 non-null   object 
 44  is_furnished                  0 non-null      float64
 45  is_kitchen_equipped           0 non-null      float64
 46  is_accessible                 4074 non-null   object 
 47  has_green_zones               4057 non-null   object 
 48  energy_certificate            21742 non-null  object 
 49  has_parking                   21742 non-null  bool   
 50  has_private_parking           0 non-null      float64
 51  has_public_parking            0 non-null      float64
 52  is_parking_included_in_price  7719 non-null   object 
 53  parking_price                 7719 non-null   float64
 54  is_orientation_north          11358 non-null  object 
 55  is_orientation_west           11358 non-null  object 
 56  is_orientation_south          11358 non-null  object 
 57  is_orientation_east           11358 non-null  object 
dtypes: bool(5), float64(17), int64(6), object(30)
memory usage: 8.9+ MB

In these cases, there may be many text values that only have one value. Therefore, we check if this happens or not:

import numpy as np

str_cols = data.select_dtypes(['object']).columns
str_unique_vals = data[str_cols]\
    .apply(lambda x: len(x.dropna().unique()))

str_unique_vals
title                           10736
subtitle                          146
raw_address                      9666
street_name                      6177
street_number                     420
floor                              19
is_floor_under                      2
neighborhood_id                   126
operation                           1
house_type_id                       4
is_new_development                  2
has_central_heating                 2
has_individual_heating              2
has_ac                              1
has_fitted_wardrobes                1
has_lift                            2
is_exterior                         2
has_garden                          1
has_pool                            1
has_terrace                         1
has_balcony                         1
has_storage_room                    1
is_accessible                       1
has_green_zones                     1
energy_certificate                 10
is_parking_included_in_price        2
is_orientation_north                2
is_orientation_west                 2
is_orientation_south                2
is_orientation_east                 2
dtype: int64

As you can see, we have several cases (has_green_zones, has_ac) where there is only a single value. We check a couple of those cases to see what happens with those columns:

print(data['has_garden'].unique())
print(data['has_pool'].unique())
[nan True]
[nan True]

If you notice, in both cases there is the value True and the nan. It seems that in this case the nan is actually a False. I change it.

str_unique_vals_cols = str_unique_vals[str_unique_vals == 1].index.tolist()

data.loc[:,str_unique_vals_cols] = data\
  .loc[:,str_unique_vals_cols].fillna(False)

Once this is done, now we are going to continue with the data cleaning. In this sense, I am going to do two things:

Eliminate variables with a high percentage of missing values.

Eliminate variables that do not serve me for prediction, such as the street name.

Note : if we wanted to create the best possible model, perhaps we would not do this second step, since it could give us relevant information about the area in which the house is located. However, the goal is not to create the best possible model, but to learn how you can put a Python model into production.

# Elimino variables con mucho NA
ind_keep = data.isna().sum() < 0.3 * data.shape[0]
data = data.loc[:,ind_keep]

# Remove columns
data.drop([
  'title', 'street_name','raw_address',
  'is_exact_address_hidden','is_rent_price_known',
  'is_buy_price_known', 'subtitle',
  'floor','buy_price_by_area', 'rent_price', 'id', 'Unnamed: 0'
  ], axis = 1, inplace = True)

Also, I graph the data to see if I find something:

import matplotlib.pyplot as plt
import seaborn as sns


# Cambio el tamaño
from matplotlib.pyplot import figure
plt.rcParams['figure.figsize'] = [12, 8]
plt.rcParams['figure.dpi'] = 100

str_cols = data.select_dtypes('object').columns.tolist()
num_cols = data.select_dtypes(['int', 'float']).columns.tolist()

# Selecciono datos numéricos
cor_matrix = pd.concat([data[num_cols]], axis = 1).corr()
sns.heatmap(cor_matrix)
plt.show()
data correlations

As we can see, of all the numerical variables the most correlated with the price of the house are: the number of square meters built ( sq_mt_built ), the number of bathrooms ( n_bathrooms ) and the number of rooms ( n_rooms ).

As the objective is to have few predictor variables (4 or 5) for the moment, I am left with these three numerical variables.

Now let’s see how categorical variables behave:

import math
str_cols = data.select_dtypes('object').columns

fig, ax = plt.subplots( math.ceil(len(str_cols)/3), 3, figsize=(15, 15))

for var, subplot in zip(str_cols, ax.flatten()):
    sns.boxplot(x=var, y='buy_price', data=data, ax=subplot)

plt.show()
Boxplot of categorical variables

As we can see, there are several problems in the dataset. However, right now we are not interested in that. Instead we are interested in seeing which variables are the ones that can help us the most when predicting the price of a home.

Visually, we can see how the type of house ( house_type_id ) has quite marked differences, as well as whether or not it has an ascendant ( has_lift ).

So, we are going to use these 5 variables in order to make the house price prediction. Let’s get to it.

Creation of the House Price Prediction Model

The first thing to do to make the model will be to split between train and test. For this I will use sklearn .

Note : If you do not know Sklearn in depth or there are functions that I use that you do not know, I would recommend that you read my Sklearn tutorial where I explain it in depth.

So, let’s keep building the model in Python so we can put it into production.

from sklearn.model_selection import train_test_split

keep_cols = ['sq_mt_built', 'n_bathrooms', 'n_rooms' , 'has_lift', 'house_type_id']

# Split de los datos
y = data['buy_price']
x = data[keep_cols]
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = 1234)

print(x_train.shape, y_train.shape)
(16306, 5) (16306,)

Now that we have the data loaded, I am going to analyze the completeness of the data and the outliers.

x_train.isna().sum()
sq_mt_built        98
n_bathrooms        10
n_rooms             0
has_lift         1828
house_type_id     286
dtype: int64

As we can see, the has_lift has many missing values. This may only apply to a specific type of house. We analyze it:

x_train\
  .assign(
      n_nas = x_train['has_lift'].isnull(),
      n_rows = 1
      )\
  .groupby('house_type_id')\
  .sum()\
  .reset_index()\
  .loc[:,['house_type_id', 'n_nas', 'n_rows']]
                house_type_id  n_nas  n_rows
0          HouseType 1: Pisos    312   13260
1  HouseType 2: Casa o chalet   1496    1496
2         HouseType 4: Dúplex      7     502
3         HouseType 5: Áticos      5     762

As we can see, 100% of the Chalets have an empty elevator field, while in the rest of the building types there are hardly any null fields.

In a chalet there are usually no elevators, therefore, we will set these missing values as False .

# Transformo en train y test
x_train.loc[
            x_train['house_type_id'] == 'HouseType 2: Casa o chalet', 'has_lift'
            ] = False

x_test.loc[
            x_test['house_type_id'] == 'HouseType 2: Casa o chalet', 'has_lift'
            ] = False

If we recheck the data we will see how we have very few missing values:

x_train.isna().sum()
sq_mt_built       98
n_bathrooms       10
n_rooms            0
has_lift         332
house_type_id    286
dtype: int64

In any case, I will first have to impute the NA in order to make the predictions. To do this, I will simply impute the mode (it may not be the best strategy, but as I said, the goal is not to get the best possible prediction).

In order to make the predictions, I will not need to save the modes dictionary that I have created, since the form itself will validate the data and, therefore, there will be no missing values.

# Imputo NAs con la moda
import pickle

# Calculo las modas
modes = dict(zip(x_train.columns, x_train.mode().loc[0,:].tolist()))

# Imputo la moda
for column in x_train.columns:
  x_train.loc[x_train[column].isna(),column] = modes.get(column)

Now that I have the imputed data we can create the model. To do this, I will first use Random Forest, since it usually gives good results.

from sklearn.metrics import mean_absolute_error  
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelBinarizer

# Defino el encoder
encoder = LabelBinarizer()
encoder_fit = encoder.fit(x_train['house_type_id'])

encoded_data_train = pd.DataFrame(
  encoder_fit.transform(x_train['house_type_id']),
  columns = encoder_fit.classes_.tolist()
) 

# Add encoded variables
x_train_transf = pd.concat(
  [x_train.reset_index(), encoded_data_train],
  axis = 1
  )\
  .drop(['index', 'house_type_id'], axis = 1)

# Create model
rf_reg = RandomForestRegressor()
rf_reg_fit = rf_reg\
  .fit(x_train_transf, y_train)

preds = rf_reg_fit.predict(x_train_transf)

mean_absolute_error(y_train, preds)
48713.740368575985

As we can see, we have an error of € 49,000 in train. Let’s see how much is the error in test

# Imputo la moda
for column in x_test.columns:
  x_test.loc[x_test[column].isna(),column] = modes.get(column)

# One hot encoding
encoded_data_test = pd.DataFrame(
  encoder_fit.transform(x_test['house_type_id']),
  columns = encoder_fit.classes_.tolist()
) 

x_test_transf = pd.concat(
  [x_test.reset_index(), encoded_data_test],
  axis = 1
  )\
  .drop(['index','house_type_id'], axis = 1)

preds = rf_reg_fit.predict(x_test_transf)

mean_absolute_error(y_test, preds)
131327.28633922

As we can see, in test the differences are much more considerable. If we wanted to create a useful tool, we would have to keep improving the model. However, in our case it is used as proof.

Perfect, we already have the model. But now how do we put it into production? First of all, create an API using FastAPI.

Create an API that makes predictions with FastAPI

To put our application in production, we will simply have to create an API using FastAPI that receives the inputs of the model as parameters and returns a prediction.

Note : if you don’t know how FastAPI or other tools to create APIs in Python work, I recommend that you read this post where I explain it.

So, before creating the API, we first have to save all the necessary objects. More specifically: the fashions object, the OneHotEncoder and the model.

with open('app/encoder.pickle?, 'wb') as handle:
    pickle.dump(encoder, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('app/model.pickle', 'wb') as handle:
    pickle.dump(model, handle, protocol=pickle.HIGHEST_PROTOCOL)

Now that we have the files saved, we simply have to create an API with FastAPI in a file called main.py. This file will be like the following:


from fastapi import FastAPI

app = FastAPI()

@app.post("/make_preds")
def make_preds(sq_mt:int, n_bathrooms:int, 
               n_rooms:int, has_lift:str, house_type:str):

  import pickle
  import pandas as pd

  # Load Files
  encoder_fit = pd.read_pickle("app/encoder.pickle")
  rf_reg_fit = pd.read_pickle("app/model.pickle")


  # Create df
  x_pred = pd.DataFrame(
    [[sq_mt, n_bathrooms, n_rooms, bool(has_lift), house_type]],
    columns = ['sq_mt_built', 'n_bathrooms', 'n_rooms', 
               'has_lift', 'house_type_id']
    )


  # One hot encoding
  encoded_data_pred = pd.DataFrame(
    encoder_fit.transform(x_pred['house_type_id']),
    columns = encoder_fit.classes_.tolist()
  ) 


  # Build final df
  x_pred_transf = pd.concat(
    [x_pred.reset_index(), encoded_data_pred],
    axis = 1
  )\
  .drop(['house_type_id', 'index'], axis = 1)

  preds = rf_reg_fit.predict(x_pred_transf)

  return round(preds[0])

Now, we can launch our API and test that it works. To do this, we will have to have the uvicorn module installed and we will have to execute the following code:

uvicorn main:app --reload

Finally, we can check that our API returns the value of the prediction. To do this, we are going to make a POST request to it (while it is running locally):

import requests

sq_met = 100
n_bathrooms = 2
n_rooms = 2
has_lift = True
house_type = 'HouseType 1: Pisos'  


url = f'http://127.0.0.1:8000/make_preds?sq_mt={sq_met}&n_bathrooms={n_bathrooms}&n_rooms={n_rooms}&has_lift={has_lift}&house_type={house_type}'
url = url.replace(' ', '20')

resp = requests.post(url)
resp.content
b'488557'

As we can see, we have been able to execute our model within the API and it works correctly.

Note : when making the prediction, it is usually a good idea to check the data types that enter and make an insert in some table to keep saving the predicted data. In our case, as it is a form, the data type validation like the insert would do the form itself.

So, we have already created an API that allows us to run our model. Now let’s see how we can put our Python model into production.

How to put a Python model into production

Create a Docker with the model

Although there are many ways to put a model into production, the most common is to create a Docker. If you don’t know it, Docker is a software that allows you to create isolated, self-executing and portable environments, so that you can run your code on any platform with Docker, abstracting from operating systems, language versions and packages, etc.

Note : In this post I assume a certain Docker foundation. However, if you don’t know about Docker you can learn about it in this post.

So, first of all, we are going to create a Dockerfile that allows us to install everything necessary to run our API. To do this, keep in mind that the folder where I have the app has the following files:

│   Dockerfile
│
│   requirements.txt
│   
└───app
        encoder.pickle
        main.py
        model.pickle
        modes.pickle

Taking into account that the folder is like this, my Dockerfile is the following:

FROM tiangolo/uvicorn-gunicorn-fastapi
COPY requirements.txt .
RUN pip install -r requirements.txt
RUN mkdir -p app
COPY ./app app
EXPOSE 8080
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080"]

Finally we have to mount the image, which we can do with the following function:

docker build -t modelo_produccion_python .

Now that we have the Docker image mounted, we have to upload it to our favorite Cloud tool. In my case, I will put the Python model into production on Google Cloud using Cloud Run.

Upload model to a Cloud environment

Now that we have our model on a Docker, we can upload it to our favorite Cloud service. In my case, I will upload it to Cloud Run, in such a way that the model scales from 0 (do not pay when it is not being used) to whatever is necessary.

So, I’ll use the Google Cloud SDK to upload the Docker image to the Container Registry and then deploy to Cloud Run.

Note : For explanation, I assume you have a Google Cloud account created and an assigned billing account. If you don’t have it, you can learn how to do it here . < / blockquote>

For this, we have to:

  1. Install the Google Cloud SDK, which you can install from this link .
  2. Link your computer with your Google Cloud account. To do this, you simply have to execute the following command:
gcloud auth login
  1. Link Docker with Google Cloud, so that you can upload an image from your Docker to the Container Registry. To do this you simply have to execute the following command:
gcloud auth configure-docker
  1. Tag your image so it can be uploaded. In order to tag the image you need to know the name of the image and the projectid. In my case, the image name is python_production_model and the project id is direct-analog-185510. So I have to run the following command:
docker tag modelo_produccion_python gcr.io/direct-analog-185510/modelo_produccion_python

# docker tag <image-name> grc.io/<project-id>/<image-name>
  1. Upload the image to the Container Registry. To do this, you simply have to push the image you just created. In my case:
docker push gcr.io/direct-analog-185510/modelo_produccion_python

Once you have the image in the Container Registry, from there you can choose in which service you want to display the image, as seen in the following image:

By following the steps, you will end up publishing your image on Cloud Run.

Important : if you publish the image in Cloud Run, it is important that you manage whether or not the requests should be authenticated. If you indicate that they should be, you must take this into account when making predictions.

Finally, we will end up having a url. We can test that our algorithm works by making requests to this URL:

sq_met = 100
n_bathrooms = 2
n_rooms = 2
has_lift = True
house_type = 'HouseType 1: Pisos'  


url = f'https://modelo-produccion-python-rk6gh2l6da-ew.a.run.app/make_preds?sq_mt={sq_met}&n_bathrooms={n_bathrooms}&n_rooms={n_rooms}&has_lift={has_lift}&house_type={house_type}'
url = url.replace(' ', '20')

resp = requests.post(url)

resp.content
b'488557'

As you can see, we already have our model in production working. Now we would simply have to plug in for our form to make requests to this endpoint.

Conclusion

As you can see, creating models in Python is very powerful, but knowing how to put a Python model in production is differential. With this knowledge you will not only be able to put your own models, but you will also know how to better understand DevOps (in case there are different people in your organization) or even create applications based on machine learning in a much simpler way.

Without a doubt this is an example for a simple model. From here, everything can get much more complicated: continuous deployment with Git and CI / CD, various models in production with various endpoints, more complex and scalable platforms and technologies such as Kubernetes, etc.

In any case, the basic idea is generally always the same: create the model, create an API to expose the model, create a Docker and put it into production.

I hope this blog has served you. If so, I encourage you to subscribe to the newsletter to keep up to date with new posts. And, as always, see you next time!