How to program the kmeans algorithm in Python from scratch

The kMeans algorithm is one of the most widely used clustering algorithms in the world of machine learning. Using the kMeans algorithm in Python is very easy thanks to scikit-learn. However, do you know how the kMeans algorithm works inside, the problems it can have, and the good practices that we should follow when using it?

In this post, we are going to learn all that and much more. To do this, we are going to program the kMeans algorithm from scratch and we are going to solve all the problems that we encounter. Sounds interesting? Well, let’s get to it!

How does the kmeans algorithm work

The kMeans algorithm finds those k points (called centroids) that minimize the root of the sum of squared errors. This process is done iteratively until the total error is not reduced anymore. At that time we will have reached a minimum and our observations will be classified into different groups or clusters.

Thus, the Kmeans algorithm consists of the following steps:

  1. We initialize k centroids randomly.
  2. Calculate the root of the sum of squared deviations.
  3. Assign a centroid to each of the observations.
  4. Calculate the sum of total errors and compare it with the sum in the previous iteration.
  5. If the error decreases, recalculate centroids and repeat the process.

As you can see, at a conceptual level it is a very simple algorithm, although you will see how we are going to find some more complicated points. In any case, to see how it works, I will first need some data on which to use the kmeans algorithm in Python.

Creating fake data

So, I’m going to create some very different fictitious data for three groups, as I did in the post how to program a neural network from 0 in Python .

To do this, I am going to create a function that, given some coordinates, a radius, and a number of data, will create that amount of data randomly distributed around the chosen center.

In this way, we will create three differentiated groups in which, it may be that data from different groups are close together, thus giving rise to certain errors.

import math
import numpy as np
import pandas as pd
np.random.seed(123)

def circulo(num_datos = 100,R = 1, minimo = 0,maximo= 1, center_x = 0 , center_y = 0):
    pi = math.pi
    r = R * np.sqrt(np.random.uniform(minimo, maximo, size = num_datos)) 
    theta = np.random.uniform(minimo, maximo, size= num_datos) * 2 * pi

    x = center_x + np.cos(theta) * r
    y = center_y + np.sin(theta) * r

    x = np.round(x,3)
    y = np.round(y,3)

    df = np.column_stack([x,y])
    df = pd.DataFrame(df)
    df.columns = ['x','y']
    return(df)

# Create data
datos_1 = circulo(num_datos = 20,R = 10, center_x = 5, center_y = 30)
datos_2 = circulo(num_datos = 20,R = 10, center_x = 20, center_y = 10)
datos_3 = circulo(num_datos = 20,R = 10, center_x = 50, center_y = 50)

data = datos_1.append(datos_2).append(datos_3)
data.head()
xy
0-0.54223.761
18.12925.661
24.23925.298
3-0.69125.230
43.53921.645

If we visualize our data we will see how there are three differentiated groups:

import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(datos_1['x'], datos_1['y'], c = 'b')
plt.scatter(datos_2['x'], datos_2['y'], c = 'r')
plt.scatter(datos_3['x'], datos_3['y'], c = 'g')
plt.show()
Data to be classified by kmeans algorithm

As we can see, we have three well-differentiated groups of data. Our goal will be to create a kmeans algorithm in Python that is capable of solving this problem and group them correctly.

Following the explanation above, the first step in creating our kmeans algorithm in Python will be to calculate the root of the sum of squared errors. So, let’s go for it!

Program kmeans algorithm in Python from scratch

Random initialization of the centroids

First of all, we must initialize k centroids randomly. This is not much of a mystery. However, to make the allocation process faster, it is interesting that the centroids are within the range of the data itself. This will help the algorithm to converge earlier, similar to what happens in the parameter initiation of other algorithms, such as neural networks.

So, we create a function in which, given some data and a centroid quantity, it initializes them:

def initialize_centroids(k, data):

    n_dims = data.shape[1]
    centroid_min = data.min().min()
    centroid_max = data.max().max()
    centroids = []

    for centroid in range(k):
        centroid = np.random.uniform(centroid_min, centroid_max, n_dims)
        centroids.append(centroid)

    centroids = pd.DataFrame(centroids, columns = data.columns)

    return centroids

centroids = initialize_centroids(3, data)
centroids
xy
052.8333889.953092
115.16810729.151304
252.62257157.644971

Now that we have our centroids, we must learn to calculate the error.

Calculate the root of the sum of squared errors

Surely, you have heard that the kmeans algorithm consists of measuring distance. Actually, this is not the case, the kmeans algorithm seeks to minimize the root of the sum of squared errors. This indicator can be calculated as follows:

$$ error = \sqrt{(x_{2} – x_{1})^2 + (y_{2} – y_{1})^2} $$

Interestingly, this formula exactly matches the Euclidean distance:

Fórmula de la distancia euclídea

That is why many times we get confused, although now you know that kmeans is not based on distance. In fact, if the kmeans algorithm was based on distances, we could use another type of distance measurement (as in the kNN algorithm), but this is not the case.

In any case, we can create a function that calculates the root of the sum of squared errors:

def calculate_error(a,b):
    '''
    Given two Numpy Arrays, calculates the root of the sum of squared errores.
    '''
    error = np.square(np.sum((a-b)**2))

    return error    

Since we already have the centroids, the data and the way to calculate the error, we can check that our error works correctly. To do this, we simply have to obtain the errors of the 3 centroids to an observation and visually compare it:

errors = np.array([])
for centroid in range(centroids.shape[0]):
    error = calculate_error(centroids.iloc[centroid, :2], data.iloc[0,:2])
    errors = np.append(errors, error)

errors
array([ 9239109.47028511,    76100.30291143, 15797406.01662303])

As we can see, according to the data, the second centroid (index 1) is the one that is closest to our data. Let’s check it visually:

plt.scatter(data.iloc[1:,0], data.iloc[1:,1],  marker = 'o', alpha = 0.2)
plt.scatter(centroids.iloc[:,0], centroids.iloc[:,1],  marker = 'o', c = 'r')
plt.scatter(data.iloc[0,0], data.iloc[0,1],  marker = 'o', c = 'g')
for i in range(centroids.shape[0]):
    plt.text(centroids.iloc[i,0]+1, centroids.iloc[i,1]+1, s = centroids.index[i], c = 'r')
Error of the centroids at the first observation kmeans

Indeed, after performing the visual check, we see how effectively the second centroid (index 1) is the one that is closest to the rest of the observations. So our error function seems to be working fine.

So now that we know how to find the closest centroid, let’s continue programming the kmeans function in Python. Following the previous steps, for each of the observations, we must find the centroid that is closest to it.

Assign a centroid to each of the observations

Assigning a centroid to each observation is straightforward. We simply have to find the position of the minimum value in the error list. We can achieve this very easily by combining the amin and where functions.

np.where(errors == np.amin(errors))[0].tolist()[0]
1

So we must apply this same process to all observations. As it will be a recurring thing, it is best to define it as a function.

def assign_centroid(data, centroids):
    '''
    Receives a dataframe of data and centroids and returns a list assigning each observation a centroid.
    data: a dataframe with all data that will be used.
    centroids: a dataframe with the centroids. For assignment the index will be used.
    '''

    n_observations = data.shape[0]
    centroid_assign = []
    centroid_errors = []
    k = centroids.shape[0]


    for observation in range(n_observations):

        # Calculate the errror
        errors = np.array([])
        for centroid in range(k):
            error = calculate_error(centroids.iloc[centroid, :2], data.iloc[observation,:2])
            errors = np.append(errors, error)

        # Calculate closest centroid & error 
        closest_centroid =  np.where(errors == np.amin(errors))[0].tolist()[0]
        centroid_error = np.amin(errors)

        # Assign values to lists
        centroid_assign.append(closest_centroid)
        centroid_errors.append(centroid_error)

    return (centroid_assign,centroid_errors)

data['centroid'], data['error'] = assign_centroid(data.iloc[:,:2] ,centroids)
data[['centroid', 'error']].head()
centroiderror
0176100.302911
113810.746835
2118034.697838
3171229.148068
4136703.174251

Now that we have the assignment made, we can visually check how this assignment has been:

colors = {0:'red', 1:'blue', 2:'green'}

plt.scatter(data.iloc[:,0], data.iloc[:,1],  marker = 'o', c = data['centroid'].apply(lambda x: colors[x]), alpha = 0.5)
plt.scatter(centroids.iloc[:,0], centroids.iloc[:,1],  marker = 'o', s=300, 
           c = centroids.index.map(lambda x: colors[x]))
Result of the first kmeans clustering

As we can see, the closest points have been assigned for each of the centroids. Now that we know how to assign a centroid to each observation, let’s continue programming our kmeans algorithm in Python.

Calculate the sum of total errors

Calculating the sum of total errors is very simple. As we have already stored the error of the operation in a column, simply adding the errors will give us the total number of errors:

data['error'].sum()
7089622.406081449

We must compare this sum with the result of the previous iteration. In our case, it is the first iteration so we will always continue.

Also, comment that generally the match with the previous error must be exact. This would mean that our centroids have not moved, since they have reached a point that minimizes the error.

In any case, this is something we’ll see when we put it all together. At the moment, we continue to program our kmeans algorithm in Python. Let’s see how to recalculate centroids!

Recalculate the position of the centroids

If the previous point has not been fulfilled, that is, if the total error has decreased, we will have to recalculate the position of the centroids to repeat the process . To recalculate the centroids we simply have to calculate the mean position of the centroid as an average of its variables.

As in the data frame data we have the information of the observation and the centroids that has been assigned to each observation, it is something that we can do very easily with this data frame:

data_columns = ['x','y']

centroids = data.groupby('centroid').agg('mean').loc[:,data_columns].reset_index(drop = True)
centroids
xy
026.8310004.002000
110.90225619.928974
248.89880050.526050

In fact, we could re-visualize the data and we will see how the centroids have changed their position, placing themselves in the center of the data:

plt.scatter(data.iloc[:,0], data.iloc[:,1],  marker = 'o', c = data['centroid'].apply(lambda x: colors[x]), alpha = 0.5)
plt.scatter(centroids.iloc[:,0], centroids.iloc[:,1],  marker = 'o', s=300, 
           c = centroids.index.map(lambda x: colors[x]))
Recalculate the centroids

Now that we have the centroids recalculated, we should repeat the above process, until the error no longer decreases. So, let’s see how to put everything together and finish creating our kmeans algorithm in Python.

Putting it all together to program our kmeans algorithm in Python

Finally, we simply have to put all the previous process together inside a while loop, in such a way that we create our kmeans algorithm in Python. Let’s get to it!

def knn(data, k):
    '''
    Given a dataset and number of clusters, it clusterizes the data. 
    data: a DataFrame with all information necessary
    k: number of clusters to create
    '''

    # Initialize centroids and error
    centroids = initialize_centroids(k, data)
    error = []
    compr = True
    i = 0

    while(compr):
        # Obtain centroids and error
        data['centroid'], iter_error = assign_centroid(data,centroids)
        error.append(sum(iter_error))
        # Recalculate centroids
        centroids = data.groupby('centroid').agg('mean').reset_index(drop = True)

        # Check if the error has decreased
        if(len(error)<2):
            compr = True
        else:
            if(round(error[i],3) !=  round(error[i-1],3)):
                compr = True
            else:
                compr = False
        i = i + 1 

    data['centroid'], iter_error = assign_centroid(data,centroids)
    centroids = data.groupby('centroid').agg('mean').reset_index(drop = True)
    return (data['centroid'], iter_error, centroids)

Now we can apply this function to our data. To do this, I must first pass the data by removing the centroid, and error variables that I have previously created.

data['centroid'], _, centroids =  knn(data.drop(['centroid','error'], axis = 1),3)
data['centroid'].head()                                   
0    0
1    0
2    0
3    0
4    0
Name: centroid, dtype: int64

Finally, we can check how this clustering has ended visually:

plt.scatter(data.iloc[:,0], data.iloc[:,1],  marker = 'o', c = data['centroid'].apply(lambda x: colors[x]), alpha = 0.5)
plt.scatter(centroids.iloc[:,0], centroids.iloc[:,1],  marker = 'o', s=300, 
           c = centroids.index.map(lambda x: colors[x]))
Clustering with minimal error

As we can see, the algorithm has clustered the data. It has obtained the desired result… luckily. But this is not always the case. Let’s see what happened and how we can obtain better results from our kmeans algorithm programmed in Python!

How to improve the results of the kMeans algorithm programmed in Python

Get a better data classification

When the kMeans algorithm returns a result, it means that it has reached a point that minimizes the error: it has reached a minimum. However, that point does not have to be the global minimum, that is, the best possible answer, but rather it can be a local minimum. This can be clearly seen in the following image:

Diferencia entre mínimo local y mínimo global

This is due to random initialization of centroids and chance. It may be the case that the centroids have been initialized in such a way that the data does not end up well classified, resulting in poor clustering .

Fortunately, solving the problem of the random initialization of the centroids is very simple: run the kmeans algorithm several times . On each of the iterations we can measure the total error, in such a way that, in cases where the local error is the minimum, we will have reached the best possible solution.

Logically, this has a trade-off, which is that it makes executing the kmeans algorithm somewhat slower than it actually could be.

In any case, let’s see how we can solve it in our case:

num_trials = 10

classifications = []
errors = []
centroids = []

for i in range(num_trials):

    np.random.seed(i)

    iter_class, iter_error, iter_centroid = knn(data.drop(['centroid','error'], axis = 1),4)

    classifications.append(iter_class)
    errors.append(sum(iter_error))
    centroids.append(iter_centroid)

errors
[175816.14665013814,
 175816.14665013814,
 175816.14665013814,
 2997534.4064878933,
 2997534.4064878933,
 2997534.4064878933,
 175816.14665013814,
 175816.14665013814,
 2958942.2373577864,
 175816.14665013814]

If you notice, although several tests have given the same result (175816), others have given a suboptimal result (2997534, 2958942), that is, they have not classified the data well.

On the other hand, if we had run the algorithm only once, we could have been unlucky due to the random initialization of the centroids, getting a result that is not optimal.

However, if you run the algorithm multiple times, you can find which iteration minimizes the error. That will be the best ranking the kmeans algorithm I just programmed will have achieved.

So, let’s visualize it:

errors = np.array(errors)
best_ind = np.where(errors == errors.min())[0].tolist()[0]

data['centroid'] = classifications[best_ind]

plt.scatter(data.iloc[:,0], data.iloc[:,1],  marker = 'o', c = data['centroid'].apply(lambda x: colors[x]), alpha = 0.5)
plt.scatter(centroids[best_ind].iloc[:,0], centroids[best_ind].iloc[:,1],  marker = 'o', s=300, c = centroids[best_ind].index.map(lambda x: colors[x]))
Classification with minimum error 2

Wow … something weird happened. And, although the error is indeed minor, it seems that a cluster has “disappeared” … This raises another important issue that we must take into account when using the kmeans algorithm: the fading of clusters.

Cluster fading

If the number of groups that we ask the algorithm is very different from what the algorithm really “finds”, it may be the case that a cluster does not have any close points and, therefore, disappears.

This obviously should not be allowed: if the user wants a cluster with $ k = 3 $, we will have to return a result with $ k = 3 ·. However, the fading of the cluster can hide a problem behind it: that the number of clusters is not well chosen.

Let’s see an example of what I comment:

_, _, centroids =  knn(data.drop(['centroid','error'], axis = 1),5)
centroids
xy
019.89418.44635
14.800926.51980
20.612834.71070
348.898850.52605

As we can see, we have asked the algorithm to cluster in 5 clusters, but it has only created 4 … This is not something we want in our kmeans algorithm.

Thus, we must modify our algorithm so that:

  1. In case a centroid “vanishes” re-initialize it.
  2. Send a warning in case a centroid vanishes.
import warnings

def knn(data, k):
    '''
    Given a dataset and number of clusters, it clusterizes the data. 
    data: a DataFrame with all information necessary
    k: number of clusters to create
    '''

    # Initialize centroids
    centroids = initialize_centroids(k, data)
    error = []
    compr = True
    i = 0

    while(compr):
        # Obtain centroids and error
        data['centroid'], iter_error = assign_centroid(data,centroids)
        error = np.append(error, sum(iter_error))
        # Recalculate centroids
        centroids = data.groupby('centroid').agg('mean').reset_index(drop = True)

        # Re initialize centroids
        if(centroids.shape[0] < k):
            warnings.warn("Cluster devanished! Consider reducing the number of k")
            #raise Warning("Vanished centroid. Consider changing the number of clusters.")
            number_centroids_reinitialize = k - centroids.shape[0] 
            reinitialized_centroids = initialize_centroids(number_centroids_reinitialize, data.drop(['centroid'], axis = 1))

            # Find the index of the centroids that  are missing
            ind_missing = np.isin(np.array(range(k)), centroids.index)
            reinitialized_centroids.index = np.array(range(k))[ind_missing == False]

            # Include the new centroids
            centroids = centroids.append(reinitialized_centroids)

        # Check if the error has decreased
        if(len(error)<2):
            compr = True
        else:
            if(round(error[i],3) !=  round(error[i-1],3)):
                compr = True
            else:
                compr = False
        i = i + 1 


    #data['centroid'], iter_error = assign_centroid(data,centroids)
    #centroids = data.groupby('centroid').agg('mean').reset_index(drop = True)

    return (data['centroid'], error[-1], centroids)

Now we can check how the classification is after having made this modification:

_, _, centroids = knn(data.drop(['centroid','error'], axis = 1),5)
centroids

<ipython-input-19-68fc52047743>:25: UserWarning: Cluster devanished! Consider reducing the number of k
 warnings.warn("Cluster devanished! Consider reducing the number of k")    
xy
054.33840046.538400
12.70685030.615250
219.8941008.446350
348.80450056.169875
445.12114346.924286

As we can see, in this case the function has returned a warning because a cluster has vanished, but the function has continued to run.

Now we know a way to improve the result of the kmeans algorithm in Python and one of the typical problems that arises when programming it (and that many packages do not even include).

However, there is another very important issue that can help us improve the result of our clusterings and that is very simple: the scaling of the data.

Scaling the data

As we have seen before, the kmeans algorithm is based on minimizing the root of the sum of errors squared. In this process it is important that we take into account the scales of the data. After all, if one variable has a much larger scale than the other, it will influence the error much more, which can lead to incorrect classifications.

Let’s see an example to check what I’m saying:

a = np.array([180,2])
b = np.array([220,2])
c = np.array([230,6])

plt.xlim([0,300])
plt.ylim([0,8])
plt.scatter(a[0], a[1],  marker = 'o', c = 'g', s=300)
plt.scatter(b[0], b[1],  marker = 'o', c = 'b', s=300)
plt.scatter(c[0], c[1],  marker = 'o', c = 'r', s=300)
Importance of normalizing data in kmeans algorithm

If we see this case in which we have two clusters (red and green point) and an observation (blue point), visually we see how the green point has less error to the blue point (it is closer) than the red point.

However, if we calculate the errors, we would see the following:

print('Error of green to blue: '+ str(calculate_error(a,b)))
print('Error of red to blue: '+ str(calculate_error(c,b))) 
Error of green to blue: 2560000
Error of red to blue: 13456

As we see the error from green to blue is much greater than from red to blue … Therefore our algorithm would assign the blue point to the red cluster, although visually we see that it should not be like that. This is due to the scales: since the x-axis has a much larger scale, the error is larger than on the y-axis.

To solve this, we simply have to normalize the data, that is, make all the data follow the same scale. This can be achieved by applying the following formula:

\(\)  $$ x_{norm}= \frac{x – x_{min}}{ x_{max}- x_{min}} $$

In this way we will get the data to have a range from 0 to 1, in such a way that all the data follow the same scale. We check it in the previous case:

def normalize(x):
    return (x-x.min())/(x.max()-x.min())

temp_data = pd.DataFrame([a,b,c], columns = ['x','y'])
temp_data = temp_data.apply(normalize)

print('Error of green to blue(datos normalizados): '+ str(calculate_error(temp_data.loc[0],temp_data.loc[1])))
print('Error of red to blue(datos normalizados): '+ str(calculate_error(temp_data.loc[2],temp_data.loc[1])))
Error of green to blue(datos normalizados): 0.4096000000000002
Error of red to blue (datos normalizados): 1.0816000000000001

As you can see, after normalizing the data, the situation has changed: the cluster with the least error with respect to the blue point is the green cluster, which is exactly what we see visually.

So, before applying the kmeans algorithm (and the rest of the clustering algorithms) it is essential to first normalize our data. It is something very simple, and the impact can be very great.

Having said that, there is one last question that is important to use our kmeans algorithm in Python well: knowing how to choose the number of clusters.

How to choose the number of clusters with the kmeans algorithm in Python

So far the number of clusters I have chosen has been completely subjective. However, when we are faced with a real problem (such as user segmentation, for example), we do not know how many clusters there may be. So how do we choose the number of clusters?

Choosing the number of clusters to assign is very simple. To do this, you have to understand two issues:

  • If you choose a very low number of clusters, the error of the clusters to the observations will be high, since they will be far away.
  • If you choose a very very high number of clusters, the error will be very low, until it becomes 0.

In other words, as the number of clusters increases, the total error generated by our algorithm will decrease. However, this decrease will not occur in the same way: going from 1 cluster to two will reduce the error more than going from $ n-1 $ to $ n $.

So, if we graph the total error for different number of clusters, there will be a point at which the error will continue to decrease, but in a much less exaggerated way. This will be the ideal number of clusters that we should choose.

This way of choosing the number of clusters is known as the elbow method, since we are looking for the elbow of the graph.

Let’s see how it works with our example:

total_errors = []
n = 10

for i in range(n):
    _,it_error, _ = knn(data.drop(['centroid','error'], axis = 1),i+1) 
    total_errors.append(it_error)

plt.figure(figsize=(10,5))
plt.plot(range(1,11), total_errors )
Elbow rule to choose k in kmeans Python algorithm

As we can see, the elbow rule tells us that with our data we should choose between 2 and 3 clusters (I would choose 3), which is precisely the number of the data group that I created at the beginning of everything.

As you can see, choosing the number of clusters is very very simple, although, yes, it depends on the number of data we have, it can take time.

Conclusion

As you can see, the kmeans algorithm is a very simple algorithm to understand. However, you have a number of issues (such as cluster fading) that, if not programmed from scratch, are very easy to miss.

As always, I hope you found it interesting. If so, you can subscribe to keep up to date with the new posts I publish. In any case, see you next time!