Recommendation System with Python

In this post you will learn how to program a recommendation system with Python.

To do this, we will first see what types of recommendation systems exist, along with their advantages and disadvantages. Then, we will program each of them individually from scratch. Finally, we’ll look at tools you can use to build recommendation systems with Python.

Sound interesting to you? Let’s go with it!

Types of recommender system

A recommendation system is based on offering content to users. In this sense, there are two main types of recommender systems:

  1. Non-personalized recommendation systems : are those systems that recommend the same content to everyone. For example, when a page recommends the most popular posts, that recommendation is the same for everyone.
  2. Personalized recommendation systems : in this case, each user is recommended different content, always seeking that the content offered can be adjusted to the user’s tastes. In these cases it is difficult for two people to receive the same recommendation.

As this is a blog about Machine Learning we will focus only on custom recommender systems. But what is a recommendation system based on to personalize content? Attentive.

How does a personalized recommendation system work?

Within personalized recommendation systems there are three main approaches that can be applied: based on content, based on users, and a hybrid system. Let’s see what each of them consists of.

Content based recommendation system

As its name indicates, a content-based recommendation system uses the characteristics of the content to make recommendations. In other words, a content-based recommendation system will recommend content that is similar to content that the user has liked .

Let’s say I really like action movies and lately I’ve binge-watched the John Wick saga. Thus, a recommendation system would recommend me movies that were action, starring Keanu Reeves, directed by Chad Stahelski (director of John Wick), that were also from the same year, etc.

In short, the key to content-based recommendation systems is to define the characteristics of the content that will be used to find similar products. For the case that I have commented: main actor, director, category of the film, year, etc.

Pros and cons of the content-based recommendation system

Pros of the content-based recommendation system

  • The system allows predictions to be made from the moment a user consumes a product. That is, the system does not suffer from the problem known as cold-start .
  • It is an adaptive system: if the user changes his tastes, the system will notice these changes quickly.
  • Unpopular products can be recommended.
  • It is a system that is easy to scale.

Cons of the content-based recommendation system

  • The recommended content is always similar. In other words, if a person consumes action movies, it is difficult to recommend products from another category.

User based recommendation system

In the user-based recommendation system, the idea is to find users with similar tastes to yours and recommend content that those users have liked, but that you have not consumed.

For example, suppose I have a similar taste to my friend Mike when it comes to movies and we both like John Wick, but Mike has seen The Equalizer and Inside Out, he liked both movies, and I haven’t. viewed. These films would be candidates to be recommended.

In summary, the key to a user-based recommendation system is to be able to accurately profile the tastes of users in order to make the most accurate recommendations possible.

To achieve this goal, there are two different approaches:

  1. Memory-based systems : Consists of building a User-Content matrix and using search for similar users. Although interestingly intuitive, this approach is not very scalable, since when you have millions of users and thousands of contents, the resulting matrix is ​​too large.
  2. Model-based recommendation systems: These systems seek to overcome the limitations of memory-based systems, by using ML models such as neural networks, Bayesian networks, clustering models, etc.

Pros and cons of the user-based recommendation system

Pros of user-based recommendation system

  • You can offer content that is different from what is generally consumed. Going back to the examples, the user-based recommendation system is able to recommend Upside Down (animated movie), even though it is very different from what you usually watch (action movies).

Cons of the user-based recommendation system

  • Difficult to recommend new content. Since few (or no) people will have consumed such content, it is difficult for a user-based recommendation system to recommend new content.
  • They suffer from the cold start problem . You require a high consumption of content to be able to correctly profile a user and thus make recommendations

Hybrid recommendation system

A hybrid recommendation system is based on combining the predictions made by a content-based recommendation system and a user-based recommendation system.

In this way, the hybrid recommendation system has the strengths of both the content-based recommendation system and the user-based recommendation system.

Now that we know what each of the different types of recommender systems are and how they work, we are going to program each of them from scratch in Python. Let’s go with it!

How to program a content-based recommendation system in Python

1. Data Download

First of all, to program a recommender system you need a dataset. To do this, we are going to use the IMDB dataset, which is a dataset with information on more than 1,000 movies and series valued on IMDB. You can download the dataset from here .

So first of all we are going to load the dataset:

import pandas as pd

imdb = pd.read_csv('imdb_top_1000.csv')

As we can see, we have several variables (Genre, Director, Stars, etc.) that allow us to characterize the movie.

Now that we know what the data is like, let’s prepare it.

2. Data preparation

In order to train the model, I consider that there are two main approaches:

  1. Make the system based on the description of the film, in such a way that films with similar descriptions are recommended.
  2. Carry out the recommendation system based on the characteristics of the film (director, actors, genre, etc.).

Without a doubt, in a real case I would use a combination of both. However, to facilitate the process I am simply going to take the characteristics of the movie (excluding the description), since the description can include certain segments depending on the year, company, etc.

So as a movie can belong to several genres, I’m going to dumify the genres. For this I will use the OneHotEncoderSklearn library function.

If you want to learn more about the Sklearn library, I recommend that you read this post where I explain step by step how to use it.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline

# Filter data
keep_cols = ['Genre', 'Director', 'Star1', 'Star2']
imdb_filtered = imdb.loc[:,keep_cols]

# Create pipeline for numerical variables
numeric_pipe = Pipeline([
    ('scaler', StandardScaler())

# Create pipeline for categorical variable
categorical_pipe = Pipeline([
    ('encoder', OneHotEncoder(drop = 'first'))

# Create ColumnTransform
col_transf = ColumnTransformer([
    ('numeric', numeric_pipe, imdb_filtered._get_numeric_data().columns.tolist()),
    ('categoric', categorical_pipe, imdb_filtered.select_dtypes('object').columns.tolist()) 

col_transf_fit =
imdb_filtered_transf = col_transf_fit.transform(imdb_filtered)
<1000x2247 sparse matrix of type '<class 'numpy.float64'>'
    with 3986 stored elements in Compressed Sparse Row format>

With this we have the data ready, now let’s see how to create our content-based recommendation system with Python.

3. Programming the content-based recommendation system with Python

Generating a content-based recommendation system is relatively easy. The system is based on two steps:

  1. Find the similarity of a movie that the user liked with respect to the rest of the movies.
  2. Select the N movies that most resemble the movie that the user liked.

If you look closely, this whole process looks a lot like the kNN algorithm. However, we must take into account that we have a sparse matrix , so we cannot apply any type of distance, but we will have to apply a distance as the cosine distance.

If you want to learn more about how the kNN algorithm works, the types of distance that exist and when to use each of them, I recommend that you read this post . It is programmed in R instead of Python, but at a theoretical level it will surely be useful to you.

Thus, to set up the content-based recommendation system, we are going to calculate the matrix of similarities between the different films. For this we are going to use the function NearestNeighborsetting as the distance time cosine:

from sklearn.neighbors import NearestNeighbors


nneighbors = NearestNeighbors(n_neighbors = n_neighbors, metric = 'cosine').fit(imdb_filtered_transf)

We would already have our recommender trained. Now we just need to show a new movie to make recommendations. So let’s see what the system recommends to a person who sees the movie The Godfather:

dif, ind = nneighbors.kneighbors(imdb_filtered_transf[1])

print("Liked Film")
imdb.loc[ind[0][0], :]
print("Recommended Films")
imdb.loc[ind[0][1:], :]
Liked Film
Recommended Films

As we can see, our content-based recommendation system has worked correctly, recommending movies similar to those that the user has liked.

Now that we know how to program a content-based recommendation system in Python, let’s see how to program a user-based recommendation system.

User-based recommendation system in Python

1. Data Upload

In order to train a user-based recommendation system we need two things:

  1. Information about the content that exists.
  2. If each user has liked or not the content they have seen. This is usually measured by user rating, “Likes”, etc.

To do this, I am going to download the data from this page which includes information about books, users and the ratings that users have given to said books:

import pandas as pd
import requests
#from zipfile import ZipFile
import shutil

url = ""
resp = requests.get(url)

data_file = 'BX-CSV-Dump'

# Make request
resp = requests.get(url)

# Get filename
filename = url.split('/')[-1]

# Download zipfile
with open(filename, 'wb') as f:

# Extract Zip
# with ZipFile(filename, 'r') as zip:
#   zip.extractall('')

shutil.unpack_archive(filename, '/content')

ratings = pd.read_csv('BX-Book-Ratings.csv', sep = ";", encoding='latin-1', on_bad_lines='skip')


Now that we have the data, and before we start transforming it, let’s try to understand it. To do this, we are going to make a histogram with the valuations:


As we can see, it seems that we have many evaluations with the value of 0. Although this may seem like an error, if we go to the origin of the data (link) tell us that the valuations of 0 are implicit valuations, that is, valuations that the user has not made, but rather have been extracted from the user’s behavior.

So now that we have the data and understand how it works, let’s prepare the data so that we can train our recommendation system.

2. Data preparation

As we have seen before, there are two types of user-based recommendation systems: memory-based systems and ML-based systems.

So, in order to decide what type of system we are going to use, the first thing is to create a user-content matrix.

To do this, first of all we are going to remove the implicit evaluations that come in the data, since we do not know very well how they were arrived at and they can add noise to the model.

In addition, we are going to create a sparse matrix, since if we were to pivot, we would most likely have memory problems, since it is a very large dataset:

from scipy import sparse

res = (
 .query('`Book-Rating` != 0')

Now that we have the data ready, let’s move on to creating our user-based recommendation system with Python.

3. Programming the user-based recommendation system with Python

There are different ways in which we could create our recommendation system. In our case we are going to opt for latent semantic analysis or LSA. The idea is simple: decompose the matrix into several sub-matrices whose product results in the original matrix.

In addition, we will save some real data to check how well our recommendation system is working, since, if it works correctly, the proposed values ​​for our data should match the real values ​​of the model.

Although said like this it sounds very complex, applying it is very simple, since the Surprise library already has specific functions for this purpose.

Note: the Surprise library focuses on explicit recommendations, that is, recommendations given explicitly by the user (like the rating, for example).

The first of all will be to install surprise, which you can do with the following command:

pip install scikit-surprise

Once installed, we are going to convert our dataframe into a Surprise Dataframe. To do this, we are going to use the function load_from_df, as well as a reader that indicates the scale of the ratings and the order of the columns.

Important : Surprise datasets can only have 3 columns, in the order of user, item, rating.

from surprise import Dataset, Reader

# Convert dataframes to surprise datasets
reader = Reader(line_format='user item rating', rating_scale=(1, 5))
data = Dataset.load_from_df(res, reader)

Now that we have loaded the data, we are going to generate our training set and train our recommendation system.

The goal of this post is not to train the best possible recommendation system, but to explain, in broad strokes, how to create a collaborative recommendation system in Python.

For this, the Surprise library has several types of models, among which is SVD, which is the same as the LSA model that we have previously commented on.

So, we train our model:

from surprise import SVD

# Build full trainset
data_train_surp = data.build_full_trainset()

# Define the model
svd = SVD()

# Train the model
<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7effdb9f2690>

Now that we have the model trained, let’s see how it performs on new data. To do this, I’m going to create the test set and make the predictions on that set.

Finally, we are going to use the functions offered by surprise to evaluate how well our model works:

from surprise import accuracy

data_test_surp = data_train_surp.build_testset()

#data_test_surp = data_test_surp.construct_testset()
predictions = svd.test(data_test_surp)


As we can see, we have an MAE of 2.75, which, considering that we are talking about 1 to 5 stars, is quite high.

In order to improve it, different techniques could (and should) be applied, such as hyperparameter optimization using Grid Search.

In fact, surprise has the function GridSearchthat fulfills this purpose, although it is not the objective of this post.

Anyway, let’s see how to get predictions for a specific user and a specific movie. To do this, we must pass a tuple with the values ​​user_id and content_id to our recommendation system:

user_id = 276726
content_id = '0155061224'

svd.predict(user_id, content_id)
Prediction(uid=276726, iid='0155061224', r_ui=None, est=5, details={'was_impossible': False})


In this post I have explored the different types of recommender systems that exist and we have seen how to create the basic recommender systems with Python.

Without a doubt, this post is only intended to be a small introduction to a much broader and more complex world, since in real life recommendation systems must solve problems such as implicit recommendations, production start-up or A/B performance. testing to measure the predictive capacity of algorithms.

In any case, I hope that this post has helped you to learn a little more about the world of recommendation systems, the types that exist, the pros and cons of each one and how they are developed.

If you liked this psot, I recommend that you subscribe to be notified every time I publish a new post. Until next time!