How to code a neural network from scratch in Python

Neural networks are very powerful algorithms within the field of Machine Learning. On this post we have talked about them a lot, from coding them from scratch in R to using them to classify images with Keras. But how can I code a neural network from scratch in Python? I will explain it on this post. Sound exciting, right? So let’s get into it!

Neural networks are made of neurons. These neurons are grouped in layers: each neuron of each layer if connected with all the neurons from the previous layer. In each layer, a neuron undertakes a series of mathematical operations. When the parameters used on this operations are optimized, we make the neural network learn and that’s how we can get spectacular results.

That being said, if we want to code a neural network from scratch in Python we first have to code a neuron layer. So let’s do it!

Creating the neuron layers

In order to program a neuron layer first we need to fully understand what a neuron does. Basically a neuronal network works as follows:

  1. A layer receives inputs. On the first layer, the inputs will be the data itself and that is why it is called the input layer. On the rest of the layers, the input will be the result of the previous layer.
  2. The input values of the neuron are weighted by a matrix, called W (weights). This matrix has as many rows as neurons in the previous and as many columns as neurons in this layer.
  3. To the result from the weighted sum, another parameter is added, named bias (b). In this case, every neuron has each one bias, so for a given layer, there will be as many biases as neurons are in that layer.
  4. Finally, we will apply a function to the previous, knows as the activation function. You might have realized that until now we have only make linear transformations of the data. The activation function will ensure the non-linearity of the model. If we wouldn’t apply an activation function, any neural network would indeed be a linear regression. The result after applying the activation function will be the result of the neuron.

So, in order to create a neural network in Python from scratch, the first thing that we need to do is code neuron layers. To do that we will need two things: the number of neurons in the layer and the number of neurons in the previous layer.

So, we will create a class called capa which will return a layer if all its information: b, W, activation function, etc. Besides, as both b and W are parameters, we will initialize them. To do so, we will use trunconorm function from stats library, as it enables us to create random data give a mean and a standard deviation. It is good practice to initiate the values of the parameters with standarized values that is, with values with mean 0 and standard deviation of 1.

from scipy import stats

class capa():
  def __init__(self, n_neuronas_capa_anterior, n_neuronas, funcion_act):
    self.funcion_act = funcion_act
    self.b  = np.round(stats.truncnorm.rvs(-1, 1, loc=0, scale=1, size= n_neuronas).reshape(1,n_neuronas),3)
    self.W  = np.round(stats.truncnorm.rvs(-1, 1, loc=0, scale=1, size= n_neuronas * n_neuronas_capa_anterior).reshape(n_neuronas_capa_anterior,n_neuronas),3)

With this we have already defined the structure of a layer. Without any doubt, the definition of classes is much easier in Python than in R. That’s a good point for Python.

That being said, let’s see how activation functions work.

Activation functions

As explained before, to the result of adding the bias to the weighted sum, we apply an activation function. But, which function do we use?

In practice, we could apply any function that avoids non-linearity. However, there are some functions that are widely used. In our case we will use two functions: sigmoid function and Relu function.

Activation functions: Sigmoid Function

The sigmoid function takes a value x and returns a value between 0 and 1. That makes this function very interesting as it indicates the probability of a state to happen. For example, if we apply the sigmoid function as the activation function of the output layer in a classification problem, we will get the probability of belonging to a class.

Let’s see how the sigmoid function is coded:

import numpy as np
import math
import matplotlib.pyplot as plt


sigmoid = (
  lambda x:1 / (1 + np.exp(-x)),
  lambda x:x * (1 - x)
  )

rango = np.linspace(-10,10).reshape([50,1])
datos_sigmoide = sigmoid[0](rango)
datos_sigmoide_derivada = sigmoid[1](rango)

#We create the plot
fig, axes = plt.subplots(nrows=1, ncols=2, figsize =(15,5))
axes[0].plot(rango, datos_sigmoide)
axes[1].plot(rango, datos_sigmoide_derivada)
fig.tight_layout()

Activation functoin: ReLu function

The ReLu function it’s very simple: for negative values it returns zero, while for positive values it returns the input value. Despite being so simple, this function is one of the most (if not the most) used activation function in deep learning and neural network. The reason is that, despite being so simple it is very effective as it avoid gradient vanishing (more info here).

Let’s code the Relu activation function!

def derivada_relu(x):
  x[x<=0] = 0
  x[x>0] = 1
  return x

relu = (
  lambda x: x * (x > 0),
  lambda x:derivada_relu(x)
  )

datos_relu = relu[0](rango)
datos_relu_derivada = relu[1](rango)


# We set the range
rango = np.linspace(-10,10).reshape([50,1])

# We create the graphs
plt.cla()
fig, axes = plt.subplots(nrows=1, ncols=2, figsize =(15,5))
axes[0].plot(rango, datos_relu[:,0])
axes[1].plot(rango, datos_relu_derivada[:,0])
plt.show()
Result of the Relu activation functions and it's derivative of our tutorial on how to code a neural network from scratch

We now have coded both neuron layers and activation functions. So let’s see how to code the rest of our neural network in Python!

How to code a neural network in Python from scratch

In order to create a neural network we simply need three things: the number of layers, the number of neurons in each layer, and the activation function to be used in each layer. With these and what we have built until now, we can create the structure of our neural network.

In our case, we will use the neural network to solve a classification problem with two classes. To do so we will create a small neural network with 4 layers, that will have the following:

  • An input layer with two neurons, as we will use two variables.
  • Two hidden layers with 4 and 8 neurons respectively.
  • One output layer with just one neuron.

It is a quite complex network for such shilly problem, but it is just for you to see how everything works more easily.

Besides, we also have to define the activation function that we will use in each layer. Generally all neurons within a layer use the same activation function. In this case I will use Relu activation function in all hidden layers and sigmoid activation function in the output layer.

To do so, we have to bear in mind that Python does not allow us to create a list of functions. So, that is why we have created relu and sigmoid functions as a pair of hidden functions using lambda.

# Number of neurons in each layer 
# The first value is equal to the number of predictors.
neuronas = [2,4,8,1] 

# Activation functions to use in each layer. 
funciones_activacion = [relu,relu, sigmoid]

Now we can build the structure of our neural network. We will do that iteratively and will store all the results on the object red_neuronal.

red_neuronal = []

for paso in range(len(neuronas)-1):
  x = capa(neuronas[paso],neuronas[paso+1],funciones_activacion[paso])
  red_neuronal.append(x)

print(red_neuronal)
[<__main__.capa object at 0x000001B7A673D550>, <__main__.capa object at 0x000001B7A673DFD0>, <__main__.capa object at 0x000001B7A673D860>]

We have just created the structure of our neural network! Now we just have to code two things more. On the one hand we have to connect the whole network so that it throws us a prediction. Besides, we have to make the network learn by calculating, propagating and optimizing the error.

Making our neural network predict

In order to make our neural network predict we just need to define the calculus that it needs to make. As I have previously mentioned, there are three calculation it has to undertake: a weighted multiplication with W, adding b and applying the activation function.

In order to multiply the input values of the neuron with W we will use matrix multiplication. Let’s see the example on the first layer:

X =  np.round(np.random.randn(20,2),3) # Example of an input vector

z = X @ red_neuronal[0].W

print(z[:10,:], X.shape, z.shape)
[[ 1.191768 -1.039061  1.0627   -0.57164 ]
 [ 0.341264  0.412747  1.27404  -0.336425]
 [ 0.218592 -0.431019 -0.133344 -0.046377]
 [ 0.24092  -0.24311   0.169692 -0.107519]
 [-1.104672  1.410984 -0.373584  0.420948]
 [-0.143616 -0.170053 -0.531184  0.140693]
 [ 0.49172  -0.532545  0.296708 -0.210606]
 [-0.779768  0.724851 -0.633884  0.363078]
 [-0.006408 -0.161554 -0.233908  0.043721]
 [ 0.469408 -0.394901  0.438176 -0.228647]] (20, 2) (20, 4)

Now we just have to add the bias parameter to z.

z = z + red_neuronal[0].b

print(z[:5,:])
[[ 1.878768e+00 -7.130610e-01  1.195700e+00 -7.946400e-01]
 [ 1.028264e+00  7.387470e-01  1.407040e+00 -5.594250e-01]
 [ 9.055920e-01 -1.050190e-01 -3.440000e-04 -2.693770e-01]
 [ 9.279200e-01  8.289000e-02  3.026920e-01 -3.305190e-01]
 [-4.176720e-01  1.736984e+00 -2.405840e-01  1.979480e-01]]

Now we have to apply the activation function of this layer.

a = red_neuronal[0].funcion_act[0](z)
a[:5,:]

Out[8]:

array([[ 1.878768, -0.      ,  1.1957  , -0.      ],
       [ 1.028264,  0.738747,  1.40704 , -0.      ],
       [ 0.905592, -0.      , -0.      , -0.      ],
       [ 0.92792 ,  0.08289 ,  0.302692, -0.      ],
       [-0.      ,  1.736984, -0.      ,  0.197948]])

With that we have the result of the first layer, that will be the input for the second layer. So, this is a process that can clearly get done on a for loop:

output = [X]

for num_capa in range(len(red_neuronal)):
  z = output[-1] @ red_neuronal[num_capa].W + red_neuronal[num_capa].b
  a = red_neuronal[num_capa].funcion_act[0](z)
  output.append(a)

print(output[-1])
[[0.61015892]
 [0.45732425]
 [0.55470963]
 [0.5508974 ]
 [0.38760538]
 [0.49644835]
 [0.57029934]
 [0.41200621]
 [0.51346008]
 [0.56647143]
 [0.42297515]
 [0.42578556]
 [0.41394167]
 [0.45673122]
 [0.50093812]
 [0.41234362]
 [0.5878003 ]
 [0.43970666]
 [0.59577249]
 [0.58614669]]

We have just make our neural network predict! Awesome, right? As it is the first round, the network has not trained yet. That is why the results are so poor. So, in order to entirely code our neural network from scratch in Python we just have one thing left: to train our neural network. Let’s do it!

Training our neural network

Creating the cost function

In order to train or improve our neural network we first need to know how much it has missed. To do so we will use a very typical cost function, that, despite not being the best for binary classification, will still do the trick: the Mean Square Error (MSE).

The MSE is quite simple to calculate: you subtract the real value from every prediction, square it, and calculate its square root. Besides, we will also calculate the derivative of the cost function as it will be useful for backpropagation:

def mse(Ypredich, Yreal):

  # Calculamos el error
  x = (np.array(Ypredich) - np.array(Yreal)) ** 2
  x = np.mean(x)

  # Calculamos la derivada de la funcion
  y = np.array(Ypredich) - np.array(Yreal)
  return (x,y)

With this, we will make up some labels for the predictions that we have get before, so that we can calculate the cost function.

from random import shuffle

Y = [0] * 10 + [1] * 10
shuffle(Y)
Y = np.array(Y).reshape(len(Y),1)

mse(output[-1], Y)[0]

Out[11]:

0.27785863420024487

Now that we have calculated the error we have to move it backwards so that we can know how much error has each neuron make. Afterwards we will use that error to optimize the parameters. By doing so is how the neural network trains.

Backpropagation and gradient descent: training our neural network

Gradient descent: optimizing the parameters

With gradient descent we will optimize the parameters. If you remember, when we have created the structure of the network, we have initialize the parameters with random value. Obviously those values are not the optimal ones, so it is very unlikely that the network will perform well at the beginning.

To see it more visually, let’s imagine that the parameters have been initialized in this position:

As you can see, the values are far from their optimum positions (the blue ones at the bottom). We need to make our parameters go there, but how do we do that?

To do so we will use gradient descent. Gradient descent takes the error at one point and calculates the partial derivatives at that point. By doing so we calculate the gradient vector, that is, a vector that points the direction where the error increases. So, if we take the reverse value of the gradient vector, we will go deeper in the graph. In summary, gradient descent calculates the reverse of the gradient to improve the hyperparameters.

How deeper we will move on the graph will depend on another hyperparameter: the learning rate. Despite this hyperparameter is not optimized, there are two things to bear in mind:

  • If the learning rate is too low it will take a long time for the algorithm to learn because each step will be very small.
  • If the learning rate is too high you might give too big steps so that you never reach to the optimal value.

In order to avoid this some techniques can be applied, such as learning rate decade. In our case, we will not make it more difficult than it already is, so we will use a fixed learning rate.

With gradient descent, at each step, the parameters will move towards their optimal value, until they reach a point where they do not move anymore. At that point we can say that the neural network is optimized.

Proceso de optimización mediante gradient descent. Optimizing process with gradient descent

This sounds cool. It sounds easy to calculate on the output layer, as we can easily calculate the error there, but what happens with other layers? For that we use backpropagation:

Backpropagation: calculating the error in each layer

When making a prediction, all layers will have an impact on the prediction: if we have a big error on the first layer it will affect the performance of the second layer, the error of the second will affect the third layer, etc.

So, the only way to calculate error of each layer is to do it the other way around: we calculate the error on the last layer. With that we calculate the error on the previous layer and so on.

Besides, this is a very efficient process because we can use this back propagation to adjust the parameters W and b using gradient descent.

That being said, let’s see how gradient descent and backpropagation work. To do so, we will check the values of W and b on the last layer:

red_neuronal[-1].b
red_neuronal[-1].W
array([[ 0.583],
       [-0.692],
       [-0.15 ],
       [-0.69 ],
       [-0.547],
       [-0.316],
       [-0.581],
       [ 0.369]])

As we have initialized this parameters randomly, their values are not the optimal ones. Thus, in every step the parameters will continuosly change. To do so, we first have to move the error backwards.

The error is calculated as the derivative of the lost function multiplied by the derivative of the activation function. In our case, the result is stored on the layer -1, while the value that we want to optimize is on the layer before that (-2). Moreover, as we have defined the activation functions as a pair of functions, we just need to indicate the index 1 to get the derivative.

# Backprop en la ultima capa
a = output[-1]
x = mse(a,Y)[1] * red_neuronal[-2].funcion_act[1](a)

x
array([[-0.38984108],
       [-0.54267575],
       [ 0.55470963],
       [ 0.5508974 ],
       [ 0.38760538],
       [ 0.49644835],
       [ 0.57029934],
       [-0.58799379],
       [-0.48653992],
       [ 0.56647143],
       [-0.57702485],
       [-0.57421444],
       [-0.58605833],
       [ 0.45673122],
       [-0.49906188],
       [-0.58765638],
       [ 0.5878003 ],
       [ 0.43970666],
       [ 0.59577249],
       [-0.41385331]])

If we did this on every layer we would propagate the error generated by the neural network. However, just calculating the error is useless. Now we need to use that error to optimize the parameters with gradient descent. To do so, we need to calculate the derivatives of b and W and subtract that value from the previous b and W.

red_neuronal[-1].b = red_neuronal[-1].b - x.mean() * 0.01
red_neuronal[-1].W = red_neuronal[-1].W - (output[-1].T @ x) * 0.01

red_neuronal[-1].b
red_neuronal[-1].W
array([[ 0.58338478],
       [-0.69161522],
       [-0.14961522],
       [-0.68961522],
       [-0.54661522],
       [-0.31561522],
       [-0.58061522],
       [ 0.36938478]])

With this we have just optimized a little bit W and b on the last layer. If we want to calculate the error on the previous layer we have to undertake a matrix multiplication of this layers error and its weights (W). But, we have just uploaded the values of W, so how do we do that?

In order to solve that problem we need to create some object that stores the values of W before it is optimized. In my case I have named this object as W_temp. By doing this, we are able to calculate the error corresponding to each neuron and optimize the values of the parameters all at the same time.

If we put everything together, the formula of backpropagation and gradient descent is as follows:

# We define the learning rate
lr = 0.05

# We create the inverted index
back = list(range(len(output)-1))
back.reverse()

# We create a vector where we will store the errors of each layer
delta = []

for capa in back:
  # Backprop #

  # We store the result of the last layer before using gradient descent
  a = output[capa+1][1]

  # Backprop on the last layer
  if capa == back[0]:
    x = mse(a,Y)[1] * red_neuronal[capa].funcion_act[1](a)
    delta.append(x)

  # Backprop on the rest of the layers
  else:
    x = delta[-1] @ W_temp * red_neuronal[capa].funcion_act[1](a)
    delta.append(x)

  # We store the values of W in order to use them on the next iteration
  W_temp = red_neuronal[capa].W.transpose()

  # Gradient Descent #

  # We adjust the values of the parameters
  red_neuronal[capa].b = red_neuronal[capa].b - delta[-1].mean() * lr
  red_neuronal[capa].W = red_neuronal[capa].W - (output[capa].T @ delta[-1]) * lr


print('MSE: ' + str(mse(output[-1],Y)[0]) )
print('Estimacion: ' + str(output[-1]) )
MSE: 0.5
Estimacion: [[1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]]

With this we have just applied backpropagation and gradient descent. You have learned how to code a neural network from scratch in Python! That is awesome! What about testing our neural network on a problem? Let’s do it!

Example : trying our neural network

Defining the problem: classifing points

We will test our neural network with quite an easy task. to classify between two types of points. To do so, we first need to create a function that returns numbers around an imaginary circle with radius of R.

import random

def circulo(num_datos = 100,R = 1, minimo = 0,maximo= 1):
  pi = math.pi
  r = R * np.sqrt(stats.truncnorm.rvs(minimo, maximo, size= num_datos)) * 10
  theta = stats.truncnorm.rvs(minimo, maximo, size= num_datos) * 2 * pi *10

  x = np.cos(theta) * r
  y = np.sin(theta) * r

  y = y.reshape((num_datos,1))
  x = x.reshape((num_datos,1))

  #We reduce the number of elements so that there is no overflow
  x = np.round(x,3)
  y = np.round(y,3)

  df = np.column_stack([x,y])
  return(df)

We now create two sets of random data, each with 150 data points. Besides it sets of data will have different radius. As the results might overflow a little, it will not be easy for our neural network to get them all right.

datos_1 = circulo(num_datos = 150, R = 2)
datos_2 = circulo(num_datos = 150, R = 0.5)
X = np.concatenate([datos_1,datos_2])
X = np.round(X,3)

Y = [0] * 150 + [1] * 150
Y = np.array(Y).reshape(len(Y),1)

We just have created our both training and testing input data. Let’s visualize the problem that our neural network will have to face:

plt.cla()
plt.scatter(X[0:150,0],X[0:150,1], c = "b")
plt.scatter(X[150:300,0],X[150:300,1], c = "r")
plt.show()
Classification problem for a neural network

Training our neural network

The first thing is to convert the code that we have created above into functions. This will help us a lot.

def entrenamiento(X,Y, red_neuronal, lr = 0.01):

  output = [X]

  for num_capa in range(len(red_neuronal)):
    z = output[-1] @ red_neuronal[num_capa].W + red_neuronal[num_capa].b

    a = red_neuronal[num_capa].funcion_act[0](z)

    output.append(a)

  # Backpropagation

  back = list(range(len(output)-1))
  back.reverse()


  delta = []

  for capa in back:
    # Backprop #delta

    a = output[capa+1]

    if capa == back[0]:
      x = mse(a,Y)[1] * red_neuronal[capa].funcion_act[1](a)
      delta.append(x)

    else:
      x = delta[-1] @ W_temp * red_neuronal[capa].funcion_act[1](a)
      delta.append(x)

    W_temp = red_neuronal[capa].W.transpose()

    # Gradient Descent #
    red_neuronal[capa].b = red_neuronal[capa].b - np.mean(delta[-1], axis = 0, keepdims = True) * lr
    red_neuronal[capa].W = red_neuronal[capa].W - output[capa].transpose() @ delta[-1] * lr

  return output[-1]

We have the training function! Before checking the performance I will reinitialize some objects. By doing so we ensure that nothing of what we have done before will affect:

class capa():
  def __init__(self, n_neuronas_capa_anterior, n_neuronas, funcion_act):
    self.funcion_act = funcion_act
    self.b  = np.round(stats.truncnorm.rvs(-1, 1, loc=0, scale=1, size= n_neuronas).reshape(1,n_neuronas),3)
    self.W  = np.round(stats.truncnorm.rvs(-1, 1, loc=0, scale=1, size= n_neuronas * n_neuronas_capa_anterior).reshape(n_neuronas_capa_anterior,n_neuronas),3)

neuronas = [2,4,8,1] 
funciones_activacion = [relu,relu, sigmoid]
red_neuronal = []

for paso in list(range(len(neuronas)-1)):
  x = capa(neuronas[paso],neuronas[paso+1],funciones_activacion[paso])
  red_neuronal.append(x)    

We have the network ready! We will simply store the results so that we can see how our network is training:

error = []
predicciones = []

for epoch in range(0,1000):
  ronda = entrenamiento(X = X ,Y = Y ,red_neuronal = red_neuronal, lr = 0.001)
  predicciones.append(ronda)
  temp = mse(np.round(predicciones[-1]),Y)[0]
  error.append(temp)

There is no error, so it looks like everything has gone right. Now let’s see how it has improve:

epoch = list(range(0,1000))
plt.plot(epoch, error)
[<matplotlib.lines.Line2D at 0x1b7a6a9a9e8>]
Evolution of the loss of a neural network coded from scratch in Python

Our neural network has trained! In fact, it has gone from an error of 0.5 (completely random) to just an error of 0.12 on the last epoch.

As we can see from epoch 900 on the network has not improve its performce. This is because the parameters were already optimized, so it could not improve more.

Conclusion

Regardless of whether you are an R or Python user, it is very unlikely that you are ever required to code a neural network from scratch, as we have done in Python. Most certainly you will use frameworks like Tensorflow, Keras or Pytorch.

Anyway, knowing how to code a neural network from scratch requieres you to strengthen your knowledge on neural networks, which is great to ensure that you deeply understand what you are doing and getting when using the frameworks stated above.

So, regardless of the language you use, I would deeply recommed you to code a neural network from scratch. It will take you a lot of time for sue. You will have setbacks. I have have them too (with classes in R and matrixes in Python) but despite that it is worth it all the way.

As always, I hope you have enjoyed the content. For any doubts, do not hesitate to contact me on Linkedin and see you on the next one!