Livio/ August 11, 2019/ Python/ 0 comments

Neural Networks

Neural Network are computer systems inspired by the human brain, which can ‘learn things’ by looking at examples. They can be used in tasks like image recognition, where we want our model to classify images of animals for example. At the end of the post we will use the class that we have built to make our computer recognize a number digit by looking at pictures. The main focus of this post is on building a class in Python that can do just that.

Model Representation

Neural networks are usually represented as below:

The input layer corresponds to our training data, if for instance we’re trying to classify images, these would be a matrix m X n, where m is the number of images and n is the number of pixels of each image to which we would add a bias term (a column of 1’s) as we do in Logistic Regression.

$X=a^{(1)}=\begin{bmatrix} X_{1,0} & X_{1,1} & X_{1,2} & ... & X_{1,n}\\ X_{2,0} & X_{2,1} & X_{2,2} & ... & X_{2,n}\\ ... & ... & ... & ... & ... \\ ... & ... & ... & ... & ... \\ X_{m,0} & X_{m,1} & X_{m,2} & ... & X_{m,n}\\ \end{bmatrix}$

The output layer is how our training data should be classified. For instance, if we’re classifying digits from 0 to 9, this would be a m x 10 matrix (10 digits), where each column represents the probability of each image belonging to that category.

The hidden layers can be thought of as neurons which get switched on and off based on the activation function. They capture more and more complexity with every layer we add. They are the magic of neural networks and provide the discrimination necessary to be able to separate your training data. You can increase the number of neurons in a particular hidden layer or you can increase the number of hidden layers, or both. Increasing the number of neurons will allow you to decrease your training error but it also reduces the amount of generalization, which can be very important depending on your problem. This balance is something your learn to manage the more times you do it.

There are different activation functions that can be used, in this post we will the logistic function, which will look familiar:

$L(x)=\frac{1}{1+e^{-x}}$

Forward Propagation

The activation of the ‘neurons’ in our hidden layers is done by an algorithm called forward propagation which takes us from our input layer to our output layer. This is how it works visually:

image you have a neural network consisting of an input layer of 3 features, one hidden layer of 5 neurons, another hidden layer of 4 neurons, and an output layer of 3 classes. The network would be:

Our input layer will be a m x 4 matrix, where m is the number of observations and 4 the number of features including the bias term (3+1)

$a^{(0)}= \begin{bmatrix} x_{1,0} & x_{1,1} & x_{1,2} & x_{1,3} \\ x_{2,0} & x_{2,1} & x_{2,2} & x_{2,3} \\ .. & .. & .. & ..\\ .. & .. & .. & ..\\ x_{m,0} & x_{m,1} & x_{m,2} & x_{m,3} \\ \end{bmatrix} \mathbb{R}^{m x 4}$

The step from the input layer to the first hidden layer is done by multiplying the input layer by our first thetas matrix, which is the first set of parameters our model will need to ‘learn’ in order to minimize the cost function I will show later:

$\theta^{(1)}= \begin{bmatrix} \theta_{1,1}^ & \theta_{1,2} & \theta_{1,3} & \theta_{1,4} & \theta_{1,5}\\ \theta_{2,1}^ & \theta_{2,2} & \theta_{2,3} & \theta_{2,4} & \theta_{2,5}\\ \theta_{3,1}^ & \theta_{3,2} & \theta_{3,3} & \theta_{3,4} & \theta_{3,5}\\ \theta_{4,1}^ & \theta_{4,2} & \theta_{4,3} & \theta_{4,4} & \theta_{4,5}\\ \end{bmatrix} \mathbb{R}^{4 x 5}$

The multiplication gives us:

$a^{(1)} \theta^{(1)} = z^{(2)}= \begin{bmatrix} z_{1,1}^ & z_{1,2} & z_{1,3} & z_{1,4} & z_{1,5}\\ z_{2,1}^ & z_{2,2} & z_{2,3} & z_{2,4} & z_{2,5}\\ .. & .. & .. & .. & ..\\ z_{m,1}^ & z_{m,2} & z_{m,3} & z_{m,4} & z_{m,5}\\ \end{bmatrix} \mathbb{R}^{m x 5}$

on which we will calculate the logistic function and add a bias term, therefore our first hidden layer is equal to:

$\dpi{120} a^{(2)}= \begin{bmatrix} 1 & L(z_{1,1}) & L(z_{1,2}) & L(z_{1,3}) & L(z_{1,4}) & L(z_{1,5}) \\ 1 & L(z_{2,1}) & L(z_{2,2}) & L(z_{2,3}) & L(z_{2,4}) & L(z_{2,5})\\ 1 & .. & .. & .. & .. & ..\\ 1 & L(z_{m,1}) & L(z_{m,2}) & L(z_{m,3}) & L(z_{m,4}) & L(z_{m,5})\\ \end{bmatrix} = \begin{bmatrix} a_{1,0} & a_{1,1} & a_{1,2} & a_{1,3} & a_{1,4} & a_{1,5} \\ a_{2,0} & a_{2,1} & a_{2,2} & a_{2,3} & a_{2,4} & a_{2,5} \\ .. & .. & .. & .. & .. & ..\\ a_{m,0} & a_{m,1} & a_{m,2} & a_{m,3} & a_{m,4} & a_{m,5} \\ \end{bmatrix} \mathbb{R}^{m x 6}$

The step from the first hidden layer to the second hidden layer is done by multiplying the input layer by our second thetas matrix:

$\dpi{120} \theta^{(2)}= \begin{bmatrix} \theta_{1,1}^ & \theta_{1,2} & \theta_{1,3} & \theta_{1,4}\\ \theta_{2,1}^ & \theta_{2,2} & \theta_{2,3} & \theta_{2,4}\\ \theta_{3,1}^ & \theta_{3,2} & \theta_{3,3} & \theta_{3,4}\\ \theta_{4,1}^ & \theta_{4,2} & \theta_{4,3} & \theta_{4,4}\\ \theta_{5,1}^ & \theta_{5,2} & \theta_{5,3} & \theta_{5,4}\\ \theta_{6,1}^ & \theta_{6,2} & \theta_{6,3} & \theta_{6,4}\\ \end{bmatrix} \mathbb{R}^{6 x 4}$

The multiplication gives us:

$\dpi{120} a^{(2)} \theta^{(2)} = z^{(3)}= \begin{bmatrix} z_{1,1}^ & z_{1,2} & z_{1,3} & z_{1,4}\\ z_{2,1}^ & z_{2,2} & z_{2,3} & z_{2,4}\\ .. & .. & .. & ..\\ z_{m,1}^ & z_{m,2} & z_{m,3} & z_{m,4}\\ \end{bmatrix} \mathbb{R}^{m x 4}$

on which we will calculate the logistic function and add a bias term, therefore our second hidden layer is equal to:

$\dpi{120} a^{(3)}= \begin{bmatrix} 1 & L(z_{1,1}) & L(z_{1,2}) & L(z_{1,3}) & L(z_{1,4}) \\ 1 & L(z_{2,1}) & L(z_{2,2}) & L(z_{2,3}) & L(z_{2,4})\\ 1 & .. & .. & .. & ..\\ 1 & L(z_{m,1}) & L(z_{m,2}) & L(z_{m,3}) & L(z_{m,4})\\ \end{bmatrix} = \begin{bmatrix} a_{1,0} & a_{1,1} & a_{1,2} & a_{1,3} & a_{1,4}\\ a_{2,0} & a_{2,1} & a_{2,2} & a_{2,3} & a_{2,4}\\ .. & .. & .. & .. & ..\\ a_{m,0} & a_{m,1} & a_{m,2} & a_{m,3} & a_{m,4}\\ \end{bmatrix} \mathbb{R}^{m x 5}$

The step from the second hidden layer to the output layer is done by multiplying the second hidden layer by our third thetas matrix:

$\dpi{120} \theta^{(3)}= \begin{bmatrix} \theta_{1,1}^ & \theta_{1,2} & \theta_{1,3}\\ \theta_{2,1}^ & \theta_{2,2} & \theta_{2,3}\\ \theta_{3,1}^ & \theta_{3,2} & \theta_{3,3}\\ \theta_{4,1}^ & \theta_{4,2} & \theta_{4,3}\\ \theta_{5,1}^ & \theta_{5,2} & \theta_{5,3}\\ \end{bmatrix} \mathbb{R}^{5 x 3}$

The multiplication gives us:

$\dpi{120} a^{(3)} \theta^{(3)} = z^{(4)}= \begin{bmatrix} z_{1,1}^ & z_{1,2} & z_{1,3}\\ z_{2,1}^ & z_{2,2} & z_{2,3}\\ .. & .. & ..\\ z_{m,1}^ & z_{m,2} & z_{m,3}\\ \end{bmatrix} \mathbb{R}^{m x 3}$

on which we will calculate the logistic function:

$\dpi{120} a^{(4)}= \begin{bmatrix} L(z_{1,1}) & L(z_{1,2}) & L(z_{1,3})\\ L(z_{2,1}) & L(z_{2,2}) & L(z_{2,3})\\ .. & .. & ..\\ L(z_{m,1}) & L(z_{m,2}) & L(z_{m,3})\\ \end{bmatrix} = \begin{bmatrix} a_{1,1} & a_{1,2} & a_{1,3}\\ a_{2,1} & a_{2,2} & a_{2,3}\\ .. & .. & ..\\ a_{m,1} & a_{m,2} & a_{m,3}\\ \end{bmatrix}$

and so we have all our layers:  $\dpi{120} a^{(1)}, a^{(2)}, a^{(3)}, a^{(4)}$

Cost function

As with Logistic Regression, we are facing an optimization problem. Keeping as an example the neural network shown above, we need to find the thetas which minimize our cost function. The thetas are contained within the three thetas matrices shown above:

$\theta^{(1)}= \begin{bmatrix} \theta_{1,1}^ & \theta_{1,2} & \theta_{1,3} & \theta_{1,4} & \theta_{1,5}\\ \theta_{2,1}^ & \theta_{2,2} & \theta_{2,3} & \theta_{2,4} & \theta_{2,5}\\ \theta_{3,1}^ & \theta_{3,2} & \theta_{3,3} & \theta_{3,4} & \theta_{3,5}\\ \theta_{4,1}^ & \theta_{4,2} & \theta_{4,3} & \theta_{4,4} & \theta_{4,5}\\ \end{bmatrix} \mathbb{R}^{4 x 5}$

$\dpi{120} \theta^{(2)}= \begin{bmatrix} \theta_{1,1}^ & \theta_{1,2} & \theta_{1,3} & \theta_{1,4}\\ \theta_{2,1}^ & \theta_{2,2} & \theta_{2,3} & \theta_{2,4}\\ \theta_{3,1}^ & \theta_{3,2} & \theta_{3,3} & \theta_{3,4}\\ \theta_{4,1}^ & \theta_{4,2} & \theta_{4,3} & \theta_{4,4}\\ \theta_{5,1}^ & \theta_{5,2} & \theta_{5,3} & \theta_{5,4}\\ \theta_{6,1}^ & \theta_{6,2} & \theta_{6,3} & \theta_{6,4}\\ \end{bmatrix} \mathbb{R}^{6 x 4}$

$\dpi{120} \theta^{(3)}= \begin{bmatrix} \theta_{1,1}^ & \theta_{1,2} & \theta_{1,3}\\ \theta_{2,1}^ & \theta_{2,2} & \theta_{2,3}\\ \theta_{3,1}^ & \theta_{3,2} & \theta_{3,3}\\ \theta_{4,1}^ & \theta_{4,2} & \theta_{4,3}\\ \theta_{5,1}^ & \theta_{5,2} & \theta_{5,3}\\ \end{bmatrix} \mathbb{R}^{5 x 3}$

It is important to notice that the first row of each theta matrix corresponds to the bias terms (they’re multiplied against the bias terms of their corresponding layer), this is an observation which will need to consider as we move along.

The cost function to minimize is:

$Cost(\theta)= -\frac{1}{m} \left [\sum_{i=1}^{m} \sum_{k=1}^{p} y_{k}^{(i)}\cdot log(p_{k}^{(i)}) + (1-y_{k}^{(i)})\cdot log(1-p_{k}^{(i)}) \right ] + \frac{\lambda}{2m} \sum_{h=1}^{l-1} \sum_{j=2}^{s_{h}} \sum_{i=1}^{s_{h+1}} (\theta_{j,i}^{(h)})^{2}$

Let’s look first at the part before the plus sign

This part is similar to that of the logistic regression, the only difference is that now we’re taking into account the errors on each class of Y, which is represented by the sum from k = 1 to p, where p is the number of classes in the output layer. If we are predicting just one class, like alive/dead, sick/not sick, then it would be exactly like the one of logistic regression. Imagine our training data is made up of just five rows, and we’ve a 3 classes classification problem, then our Y  matrix may look like this:

$Y = \begin{bmatrix} 1 & 0 & 0\\ 1 & 0 & 0\\ 0 & 0 & 1\\ 0 & 1 & 0\\ 0 & 0 & 1 \end{bmatrix}$

If the first column was cat, the second column was a bird and the third column was a dog then this means that in our training data, the first image is of a cat, the second image is of a cat, the third image is of a bird, the fourth image is of a dog and the fifth image is of a bird.

Behind the scenes the first part of the cost function is doing (notice the element wise multiplication)

$\begin{bmatrix} 1 & 0 & 0\\ 1 & 0 & 0\\ 0 & 0 & 1\\ 0 & 1 & 0\\ 0 & 0 & 1 \end{bmatrix} \bigodot \begin{bmatrix} log(a_{1,1}^{(4)}) & log(a_{1,2}^{(4)}) & log(a_{1,3}^{(4)})\\ log(a_{2,1}^{(4)}) & log(a_{2,2}^{(4)}) & log(a_{2,3}^{(4)})\\ log(a_{3,1}^{(4)}) & log(a_{3,2}^{(4)}) & log(a_{3,3}^{(4)})\\ log(a_{4,1}^{(4)}) & log(a_{4,2}^{(4)}) & log(a_{4,3}^{(4)})\\ log(a_{5,1}^{(4)}) & log(a_{5,2}^{(4)}) & log(a_{5,3}^{(4)})\\ \end{bmatrix}$

+

$\begin{bmatrix} 1-1 & 1-0 & 1-0\\ 1-1 & 1-0 & 1-0\\ 1-0 & 1-0 & 1-1\\ 1-0 & 1-1 & 1-0\\ 1-0 & 1-0 & 1-1 \end{bmatrix} \bigodot \begin{bmatrix} log(1-a_{1,1}^{(4)}) & log(1-a_{1,2}^{(4)}) & log(1-a_{1,3}^{(4)})\\ log(1-a_{2,1}^{(4)}) & log(1-a_{2,2}^{(4)}) & log(1-a_{2,3}^{(4)})\\ log(1-a_{3,1}^{(4)}) & log(1-a_{3,2}^{(4)}) & log(1-a_{3,3}^{(4)})\\ log(1-a_{4,1}^{(4)}) & log(1-a_{4,2}^{(4)}) & log(1-a_{4,3}^{(4)})\\ log(1-a_{5,1}^{(4)}) & log(1-a_{5,2}^{(4)}) & log(1-a_{5,3}^{(4)})\\ \end{bmatrix}$

This results in a 5×3 matrix for which we need to sum all the elements and divide them by the number of samples (5) multiplied by -1, so -1/m.

The second part:

$\frac{\lambda}{2m} \sum_{h=1}^{l-1} \sum_{j=2}^{s_{h}} \sum_{i=1}^{s_{h+1}} (\theta_{j,i}^{(h)})^{2}$

is adding regularization. L represents the number of layers, in our example we’ve 4 layers and 3 thetas matrices, therefore the summation goes from h = 1 to 3 (L-1). Then we’re taking the square of all the thetas and summing them, except for those related to the bias terms (remember 1st row) therefore j goes from 2 (skips first row) to the size of the hth layer (Sh), and i goes from 1 to the size of the ith+1 layer(Sh+1). Lambda is our regularization parameter.

Back propagation

The goal is to find the values of thetas which minimize the cost function shown above. In order to do this, we will use the back propagation algorithm. Let’s consider again our neural network:

We need to find the partial derivative of the cost function for each theta. In order to do this we will use the back propagation algorithm. Here is how it works:

Calculate the delta corresponding to the output layer, this is equal to:

$\dpi{150} \delta ^{4} = a^{(4)} - Y$    this is a 5×3 matrix in our example

$\dpi{150} \delta ^{4} = \begin{bmatrix} a_{1,1}^{(4)} & a_{1,2}^{(4)} & a_{1,3}^{(4)}\\ a_{2,1}^{(4)} & a_{2,2}^{(4)} & a_{2,3}^{(4)}\\ a_{3,1}^{(4)} & a_{3,2}^{(4)} & a_{3,3}^{(4)}\\ a_{4,1}^{(4)} & a_{4,2}^{(4)} & a_{4,3}^{(4)}\\ a_{5,1}^{(4)} & a_{5,2}^{(4)} & a_{5,3}^{(4)}\\ \end{bmatrix} - \begin{bmatrix} y_{1,1} & y_{1,2} & y_{1,3}\\ y_{2,1} & y_{2,2} & y_{2,3}\\ y_{3,1} & y_{3,2} & y_{3,3}\\ y_{4,1} & y_{4,2} & y_{4,3}\\ y_{5,1} & y_{5,2} & y_{5,3}\\ \end{bmatrix}$

Once we have the delta, we are able to calculate the derivative of the cost function with respect to all the thetas belonging to the third thetas matrix: $\dpi{150} \theta^{(3)}$ which is:

$\dpi{150} \frac{\partial C}{\partial \theta^{(3)}} = (a^{(3)})^{T} \cdot \delta^{(4)} + \frac{\lambda}{m}\theta^{(3)}$

Calculate the delta corresponding to the third layer:

$\dpi{150} \delta^{(3)} = \delta^{(4)} \left [ \theta^{(3)} \right ]^T a^{(3)}\left ( 1-a^{(3)} \right )$   this is a 5×5 matrix in our example

from which we remove the bias terms, which now correspond to the first column because we’ve transposed the thetas 3 matrix, so it becomes a 5×4 matrix

now we can calculate the derivative of the cost function with respect to the second thetas matrix:

$\dpi{150} \frac{\partial C}{\partial \theta^{(2)}} = (a^{(2)})^{T} \cdot \delta^{(3)} + \frac{\lambda}{m}\theta^{(2)}$

Calculate the delta corresponding to the second layer:

$\dpi{150} \delta^{(2)} = \delta^{(3)} \left [ \theta^{(2)} \right ]^T a^{(2)}\left ( 1-a^{(2)} \right )$ this is a 5×6 matrix in our example

from which we remove the bias terms, which now correspond to the first column because we’ve transposed the thetas 2 matrix, so it becomes a 5×5 matrix

now we can calculate the derivative of the cost function with respect to the first thetas matrix:

$\dpi{150} \frac{\partial C}{\partial \theta^{(1)}} = (a^{(1)})^{T} \cdot \delta^{(2)} + \frac{\lambda}{m}\theta^{(1)}$

You can see the patterns:

The deltas, except for the one corresponding to the output layer, is equal to:

$\dpi{150} \delta^{(l)} = \delta^{(l+1)} \left [ \theta^{(l)} \right ]^T a^{(l)}\left ( 1-a^{(l)} \right )$ and you remove the first column

And the partial derivatives:

$\dpi{150} \frac{\partial C}{\partial \theta^{(l)}} = (a^{(l)})^{T} \cdot \delta^{(l+1)} + \frac{\lambda}{m}\theta^{(l)}$

What this algorithm is doing is propagating the error backwards from the output layer up until the first hidden layer, therefore our algorithm stops when we reach delta 2.

What our code will need to do is:

Writing the Class

So, this is where the fun begins! Before starting to write our class, it is important to stop and think about how to implement it. The first parameters we need to pass to it are the number and sizes of each hidden layer, the value of the regularization parameter and the number of maximum iterations it should perform while minimizing the cost function. Therefore we have, the following __init__ method:

The second method we need to add is to calculate the logistic function:

The first thing our algorithm will need to do when the user trains the model is to create our thetas matrices. The size of each theta matrix is given by the sizes of each layer. A specific theta matrix l: $\dpi{120} \theta^{(l)}$  has a number of rows equal to the size of l layer + 1: $\dpi{120} s^{(l)} + 1$ and has a number of column equal to the size of the l+1 layer: $\dpi{120} s^{(l+1)}$

Because the scipy.optimize algorithm needs a a vector of partial derivatives, we need to create a flattened version of our thetas matrices (all the thetas contained in a one dimensional array):

$\dpi{120} \theta^{(1)}= \begin{bmatrix} \theta_{1,1}^ & \theta_{1,2} & \theta_{1,3} & \theta_{1,4} & \theta_{1,5}\\ \theta_{2,1}^ & \theta_{2,2} & \theta_{2,3} & \theta_{2,4} & \theta_{2,5}\\ \theta_{3,1}^ & \theta_{3,2} & \theta_{3,3} & \theta_{3,4} & \theta_{3,5}\\ \theta_{4,1}^ & \theta_{4,2} & \theta_{4,3} & \theta_{4,4} & \theta_{4,5}\\ \end{bmatrix}$

$\dpi{120} \theta^{(2)}= \begin{bmatrix} \theta_{1,1}^ & \theta_{1,2} & \theta_{1,3} \\ \theta_{2,1}^ & \theta_{2,2} & \theta_{2,3} \\ \theta_{3,1}^ & \theta_{3,2} & \theta_{3,3}\\ \end{bmatrix}$

into

$\dpi{120} \bigtriangledown Cost(\theta) = \begin{bmatrix} \theta_{1,1}^{(1)} &\theta_{1,2}^{(1)} & \theta_{1,3}^{(1)} &.. & \theta_{4,5}^{(1)} &\theta_{1,1}^{(2)} & \theta_{1,2}^{(2)}&\theta_{1,3}^{(2)} &.. & \theta_{3,3}^{(2)} \end{bmatrix}$

The below method will create such flattened vector of thetas and also store the size of each theta matrix in order to recreate it when we vectorize the forward and back propagation steps:

we also need a method, as said, to reshape the flattened thetas into their appropriate sizes in order to take advantage of vectorized operations, the following method will do that for us:

So far, we have a way to create our flattened thetas, initialize them with random values and reshape them into appropriate matrices as needed. So we can take care of writing the forward propagation method, this will accept two parameters: X (a matrix corresponding to the input layer without the bias term which will be added by our method) and a list of thetas matrices properly shaped:

Now that we’ve a way to calculate each layer, we’re also ready to add the calculation of the cost function, which makes use of the above methods:

So we have our cost function and the forward propagation, we are just missing the gradient vector to feed into scipy.optimize function. But in order to have that, we need to implement back propagation first. Our back propagation algorithm will first calculate the delta corresponding to the output layer and then, in a backward loop, calculate all the remaining deltas based on how many layers we have and each time a delta is calculate, it also calculates the partial derivatives with respect the current thetas matrix:

now we’re able to create the gradient vector:

all the methods implement above are all ‘private’. When using the class, we will use the below ‘fit’ method to optimize the model:

The last method missing is a predict method:

Testing the Class

It is now time to test the model on real data. In this project we will categorize number digits from 0 to 9 by looking at images. The data set can be downloaded at the following links:

and are those used by Andrew Ng course: https://www.coursera.org/learn/machine-learning

The first file contains 5000 rows where each row represent a 20 pixel by 20 pixel grayscale image of the digit. Each pixel is represented by a floating point number indicating the grayscale intensity at that location. The pixels are unrolled into a 400-dimensional vector (20 x 20).

The second file contains 5000 rows and tells us how each image was classified. It contains 10 binary columns, where they represent the digits: 1, 2, 3, 4, 5, 6, 7, 8, 9, 0. For example

tells us that first image of a 0, the second image of 4 and the third image of 6.

Below is a Jupyter Notebook example of how to use the class and the accuracy obtained from the dataset used. You can test different regularization parameters and number of iterations to see how this affect the predictions: