Text Classification with Naive Bayes in Python

In this post you will learn what Naive Bayes is, how it works and how to use it in Python. After all, although Naive Bayes is one of the simplest models in the world of Machine Learning, it is also widely used in natural language processing projects. For this reason, Naive Bayes is a model that, yes or yes, everyone in the world of Machine Learning must master.

So, in this post you will learn:

  1. What is Naive Bayes and how does it work?
  2. How you can apply Naive Bayes in Python. In addition, it includes a practical example of text classification.
  3. What are the advantages and disadvantages of using Naive Bayes.

Sound interesting to you? Let’s go with it!

How Naive Bayes works

Naive Bayes is a Bayesian model, that is, probabilistic. This means that predictions are based on calculating probabilities.

In addition, as its name indicates, it is a Naive model, that is, naive. This is because Naive Bayes assumes that the effect of one variable is independent of all other variables . In this way, the calculation of the probabilities is much simpler.

In short, if the word “good” appears in a message, this will be independent of the word “days” appearing. As you might expect, in reality this is not the case. However, this assumption makes the calculations much easier, as we will see later.

And it is that, in mathematical terms, the mathematical calculation of Naive Bayes translates into the following formula, which is known as Bayes rule :

\(P(A|B) =\frac {P(B|A)P(A)}{P(B)}\)

  • P(A|B): probability that A occurs knowing that B has already occurred.
  • P(B|A): probability that B occurs knowing that A has occurred.
  • P(A): probability that A occurs.
  • P(B): probability that B occurs.

Taking this into account, it depends on the type of distribution that we use to calculate the probabilities, we will have one type or another of implementation of Naive Bayes.

In general, there are two main implementations of Naive Bayes: Gaussian Naive Bayes and Multinominal Naive Bayes.

Naive Bayes Gaussian

As its name suggests, Gaussian Naive Bayes assumes that the data follows a Gaussian distribution . In this case, if we develop the formula that we have seen previously, we will obtain the following function:

\(P(x_i|y)= \frac{1}{\sqrt{2\pi\sigma^{2}{y}}}e^{\frac{-(x-\mu{y})^2}{2\sigma^{2}_{y}}}\)

Basically, what Naive Bayes Gaussian does is assume a normal distribution for each variable and class given the average and standard deviation of the data obtained. In this way, if we are going to make a binary classification, for each variable we will have two distributions (one per class), as shown in the following image:

Thus, when we receive an observation, we will calculate the probability that this value comes from each of the distributions, as shown in the following image:

We will repeat this process for each of the variables and obtain the final probability that this observation belongs to each group. Finally, the class that obtains the highest probability will be the prediction of the algorithm.

When obtaining the probabilities, logarithms are usually taken. In this way, problems are avoided in the calculation of the total probability when the probability in one of the variables is very low.

Now that you understand how Gaussian Naive Bayes works, let’s see how Multinomial Naive Bayes works.

Naive Bayes Multinomial

Naive Bayes Multinomial, as you might guess, is based on assuming a Multinomial distribution. The Multinomial distribution is an extension of the Binomial distribution, in such a way that the probability of each result is independent and their sum will always add up one (link).

Thus, in the case of Naive Bayes Multinomial we must:

  1. Calculate the probability that it is of each of the classes.
  2. Probability that each value is within the same class.
  3. For each possible class, compute the final probability that, given the input dice, that data belongs to that class.

Put like that, it’s probably hard to understand. So, since this post is about understanding Naive Bayes and, above all, knowing how to do text classification in Python with Naive Bayes, let’s see a theoretical example.

Example of how Naive Bayes Multinomial works

Let’s say we want to classify text messages as spam or not spam. For this we have 20 different messages.

So, the first thing is to create a table that tells us how many times each word has appeared in cases where the message was spam and how many times they have appeared in non-spam messages. Suppose the table is the following:

| Type of Document  | Dear     | Friend| Food   | Money  |
| No Spam           | 8        | 5     | 3      | 1      |
| Spam              | 2        | 1     | 0      | 4      |

From this table we can calculate how likely it is that the word “Dear” will appear within a “No Spam” message. This is simply the proportion of times the word “Dear” has come up in “No Spam” messages:

\(P(Dear|No Spam) = \frac {8}{8+5+3+1} = 0.47\)

If we did this same process for each of the words and each of the classes, we would end up with the following table:

| Type of Document  | Dear     | Friend| Food   | Money  |
| No Spam           | 0.47     | 0.29  | 0.18   | 0.06   |
| Spam              | 0.29     | 0.14  | 0      | 0.57   |

On the other hand, we would also need to know the probability that a word is Spam or not Spam, that is, the proportion of Spam and non-Spam words.

\(Prob(NoSpam) = \frac{8 + 5 + 3 + 1}{8 + 5 + 3 + 1 + 2 + 1 + 0 + 4} = \frac{17}{17+7} = 0.71\)

\(Prob(Spam) = \frac{2 + 1 + 0 + 4}{8 + 5 + 3 + 1 + 2 + 1 + 0 + 4} = \frac{7}{17+7} = 0.29\)

With this information, suppose we received a message with the words “Dear Friend.” Now yes, we could apply the previous formula to be able to classify said message. Let’s see it:

\(P(No Spam) \times P(Dear | No Spam) \times P(Friend | No Spam) = 0.71 \times 0.47 \times 0.29 = 0.10\)

\(P(Spam) \times P(Dear | Spam) \times P(Friend| Spam) = 0.29 \times 0.29 \times 0.14 = 0.01\)

As we can see, that message is more likely to be No Spam than Spam. Therefore, the prediction made by Naive Bayes is that this message is not spam.

Problems with probabilities of zero

Surely the operation of Naive Bayes is intuitive and makes sense to you. However, what would have happened if the message that reaches us has the words “Money”, “Food” and “Money”? Let’s see it:

\(P(No Spam) = 0.71 \times 0.06 \times 0.18 \times 0.06 = 0.0004\)

\(P(Spam) = 0.29 \times 0.57 \times 0.0 \times 0.57 = 0\)

As we can see, no matter how intuitively the message is Spam, since the word “Money” appears a lot in Spam messages, the message will be classified as “No Spam”. This is because the word “Food” has never appeared in a Spam message, so its probability is zero, making the probability of the message zero.

To solve this problem, Laplace is applied. Laplace basically consists of adding 1 to all the observations. In this way, its probabilities are no longer zero and we avoid the previous problem. Let’s see it:

Table of Frequencies without Applying Laplace

| Type of Document  | Dear     | Friend| Food   | Money  |
| No Spam           | 8        | 5     | 3      | 1      |
| Spam              | 2        | 1     | 0      | 4      |

Frequency Table Applying Laplace :

| Type of Document  | Dear     | Friend| Food   | Money  |
| No Spam           | 9        | 6     | 4      | 2      |
| Spam              | 3        | 2     | 1      | 5      |

As you can see, now all the variables have appeared at least once, so that if we calculate the probabilities there is no variable with a probability of zero:

| Type of Document  | Dear     | Friend| Food   | Money  |
| No Spam           | 0.43     | 0.29  | 0.19   | 0.10   |
| Spam              | 0.27     | 0.18  | 0.09   | 0.45   |

Now, if we classify the message again with the words “Money”, “Food” and “Money” we will obtain the following prediction:

\(P(No Spam) = 0.66 \times 0.10 \times 0.19 \times 0.10 = 0.0012\)

\(P(Spam) = 0.34 \times 0.45 \times 0.09 \times 0.45 = 0.0062\)

As you can see, thanks to applying Laplace the message has gone from being classified (incorrectly) as “No Spam” to being classified (correctly) as “Spam”.

So far the theoretical introduction of Naive Bayes. As you can see, it is a very simple model but it usually works very well. Now, how can you apply Naive Bayes in Python? Let’s see it!

Text Classification in Python with Naive Bayes

The easiest way to use Naive Bayes in Python is, of course, using Scikit Learn, the main library for using Machine Learning models in Python.

If you don’t know Scikit Learn in depth, I recommend you to read this post.

In order to use the Naive Bayes model in Python, we can find it inside the naive_bayes Sklearn module. More specifically, this module has six different Naive Bayes models:   Gaussian Naive Bayes , Multinomial Naive Bayes , Complement Naive Bayes , etc.

Although there are several models, in my opinion the most used are Gaussian Naive Bayes , which is the traditional Naive Bayes model, and Multinomial Naive Bayes , which is the Naive Bayes model that is usually applied in text classification projects.

Knowing this, let’s see how to use the Naive Bayes model in Python to classify text.  

Case Study: SMS Spam Detection

As one of the main strengths of Naive Bayes is precisely its ability to use many variables, we are going to use a case of this style: the detection of spam in SMSs.

To do this, we will use the Spam Collection dataset that can be found for free on Kaggle (link). In my case, I will read the file from this repository.

So, first of all we will have to read the data:

# pip install requests
import requests 
import zipfile
import pandas as pd

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
data_file = 'SMSSpamCollection'

# Make request
resp = requests.get(url)

# Get filename
filename = url.split('/')[-1]

# Download zipfile
with open(filename, 'wb') as f:

# Extract Zip
with zipfile.ZipFile(filename, 'r') as zip:

# Read Dataset
data = pd.read_table(data_file, 
                     header = 0,
                     names = ['type', 'message']

# Show dataset
	type	message
0	ham	Ok lar... Joking wif u oni...
1	spam	Free entry in 2 a wkly comp to win FA Cup fina...
2	ham	U dun say so early hor... U c already then say...
3	ham	Nah I don't think he goes to usf, he lives aro...
4	spam	FreeMsg Hey there darling it's been 3 week's n...

Now that we have the data, let’s take a look at how to use Naive Bayes in Python. Let’s go with it!

Data Transformation

Once we have the data, first of all we have to process it. While data processing is an essential step in any machine learning project, in NLP projects (like this one) processing the data is even more important.

So, first of all we will follow the following processes:

  1. Tokenization : it consists of separating the messages into words in order to be able to treat each one of the words. We can do this thanks to the function word_tokenize.
  2. Elimination of stop words : elimination of words that do not add value (prepositions, conjunctions, etc.). This is done since it would greatly increase the size of our dataset (already large) and would not provide any value, only noise. There are two main sources of stpwords in Python, those from Sklearn and those from the NLTK library.
# Sklearn
from sklearn.feature_extraction import text
import nltk
from nltk.corpus import stopwords

In this example we use the default stopwords. However, it is usually very relevant to see the list of words and add words to it, or even delete certain words.

# Sklearn
from sklearn.feature_extraction import text
import nltk
from nltk.corpus import stopwords

In this example we use the default stopwords. However, it is usually very relevant to see the list of words and add words to it, or even delete certain words.

3. Stemming or lemmatization : the objective of this step is that two words that mean the same thing, but are not spelled the same, happen to be spelled the same. After all, for the model the word “good” and the word “good” are different. To do this there are two great techniques:

  •  Stemming: stemming consists of the elimination of the endings of the words to keep only the root. Following the previous case, in both cases (good, good) we would stay with “good”.
  • Lemmatization: It is more used in NLP projects in English, since in this language good (good), better (better) and the best (best) are completely different words. In these cases stemming would not work. Thus, the lemmatization would convert all those words to their base (good), in such a way that they come to mean the same thing.

To carry out all these processes we will use the package nltk, that is, the Natural Language Toolkit (link), which includes many functionalities on NLP for Python and that, without a doubt, will be very useful for using Naive Bayes in Python.

import nltk

# Install everything necessary

from nltk.stem.porter import *
from nltk.corpus import stopwords
stop = stopwords.words('english')

# Tokenize
data['tokens'] = data.apply(lambda x: nltk.word_tokenize(x['message']), axis = 1)

# Remove stop words
data['tokens'] = data['tokens'].apply(lambda x: [item for item in x if item not in stop])

# Apply Porter stemming
stemmer = PorterStemmer()
data['tokens'] = data['tokens'].apply(lambda x: [stemmer.stem(item) for item in x])
[nltk_data] Downloading package punkt to
[nltk_data] C:\Users\Ander\AppData\Roaming\nltk_data...
[nltk_data] Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data] C:\Users\Ander\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\stopwords.zip.

Perfect, we have already done the cleaning of our data. However, this does not end here. And it is that, currently we only have a column that has a vector of words.

As I have explained in the first theoretical section, Naive Bayes admits two issues:

  1. A TF matrix, that is, a matrix that shows, for each document, how many times each of the words in all the documents has appeared.
  2. An array of appearances. It is similar to a TF matrix, but in this case, instead of indicating the number of occurrences, it simply indicates whether or not that word appeared.

The use of each of them will depend a lot on the context. In the case of SMSs, since they are very short messages, it is unlikely that the words will be repeated, so both approaches will surely return the same result.

However, in longer texts it is probably more interesting to apply a TF matrix than an occurrence matrix.

That said, in order to get to a TF array we are going to do the following:

  1. Detokenize the values, so that the “tokens” column does not contain lists, but text. This is necessary for the third step to work properly.
  2. Make a split between train and test. It is very important to perform the train and test process before reaching the TF matrix. Otherwise, we will have data leakage problems and it can affect our results (it even allows us to check that our data pipeline is correct).
  3. Apply the Sklearn CountVectorizermodule function feature_extraction.textto our train and test data. This function allows us to create the TF array or, if we indicate the parameter binary = True, it will create an array of occurrences.

That said, let’s see how we can apply our TF matrix to be able to train our NLP model with Naive Bayes in Python:

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# Unify the strings once again
data['tokens'] = data['tokens'].apply(lambda x: ' '.join(x))

# Make split
x_train, x_test, y_train, y_test = train_test_split(
    test_size= 0.2

# Create vectorizer
vectorizer = CountVectorizer(
    strip_accents = 'ascii', 
    lowercase = True

# Fit vectorizer & transform it
vectorizer_fit = vectorizer.fit(x_train)
x_train_transformed = vectorizer_fit.transform(x_train)
x_test_transformed = vectorizer_fit.transform(x_test)

Naïve Bayes Training

Perfect, we already have our data transformed. Finally, it only remains to train the model. As we have seen before, for text projects, the function that is best suited is the MultinomialNBSklearn function.

So we just do the fit and predict in both train and test. In addition, to visualize the predictive capacity of the model we are going to see both the balanced accuracy and the confusion matrix:

# Build the model
from sklearn.naive_bayes import MultinomialNB

# Train the model
naive_bayes = MultinomialNB()
naive_bayes_fit = naive_bayes.fit(x_train_transformed, y_train)

from sklearn.metrics import confusion_matrix, balanced_accuracy_score

# Make predictions
train_predict = naive_bayes_fit.predict(x_train_transformed)
test_predict = naive_bayes_fit.predict(x_test_transformed)

def get_scores(y_real, predict):
  ba_train = balanced_accuracy_score(y_real, predict)
  cm_train = confusion_matrix(y_real, predict)

  return ba_train, cm_train 

def print_scores(scores):
  return f"Balanced Accuracy: {scores[0]}\nConfussion Matrix:\n {scores[1]}"

train_scores = get_scores(y_train, train_predict)
test_scores = get_scores(y_test, test_predict)

print("## Train Scores")
print("\n\n## Test Scores")
## Train Scores
Balanced Accuracy: 0.9888480691883254
Confussion Matrix:
 [[3837   10]
 [  12  597]]

## Test Scores
Balanced Accuracy: 0.9404936733271031
Confussion Matrix:
 [[974   3]
 [ 16 122]]

Cool! As we can see, it is a very basic model, it has a very good predictive capacity in text classification. So now you know how to train a Naive Bayes model in Python for NLP projects!

So, we are going to see the pros and cons of this model so that you know, in my opinion, when it is good for you to use it or not to use it.

Pros of Naive Bayes

  • It is a very easily interpretable model, which is a very positive point, especially in the face of NLP, where there are few alternative options that are also interpretable.
  • It is a very simple model to train and to adjust its hyperparameters, and even so it can give very good results, as we have seen in the previous example.
  • It works very well for datasets with many variables, which is rare in machine learning models. Once again, this makes it very interesting for NLP.
  • It can be used for binary classification problems as well as multiclass classification.

Cons of Naive Bayes

  • The assumption that the variables are independent. In the vast majority of cases this assumption is not fulfilled.
  • Appearance of new words or classes. In the case of NLP projects, it is very normal that when making predictions new words appear that the model has not seen when training. As a result, the model will not take these words into account when making predictions, so it will have to be retrained frequently.


Without a doubt, Naive Bayes is a very simple model to understand, interpret and apply. As a general rule, it is not usually a model that works very well in classification projects. However, when it comes to classifying text, Naive Bayes is, in my opinion, the first model to try.

In any case, when it comes to a text classification project, it is always interesting to try other types of models (such as Support Vector Classifiers) and pay close attention to the data cleaning and quality process.

I hope this post has helped you to better understand how this model works. If so, and you would like to be aware of the posts that I am uploading, I encourage you to subscribe so that I can notify you. See you in the next one!