Neural Networks

Neural Network are computer systems inspired by the human brain, which can ‘learn things’ by looking at examples. They can be used in tasks like image recognition, where we want our model to classify images of animals for example. At the end of the post we will use the class that we have built to make our computer recognize a number digit by looking at pictures. The main focus of this post is on building a class in Python that can do just that.

Model Representation

Neural networks are usually represented as below:

The input layer corresponds to our training data, if for instance we’re trying to classify images, these would be a matrix **m** X **n**, where m is the number of images and n is the number of pixels of each image to which we would add a bias term (a column of 1’s) as we do in Logistic Regression.

The output layer is how our training data should be classified. For instance, if we’re classifying digits from 0 to 9, this would be a **m** x **10** matrix (10 digits), where each column represents the probability of each image belonging to that category.

The hidden layers can be thought of as neurons which get switched on and off based on the activation function. They capture more and more complexity with every layer we add. They are the magic of neural networks and provide the discrimination necessary to be able to separate your training data. You can increase the number of neurons in a particular hidden layer or you can increase the number of hidden layers, or both. Increasing the number of neurons will allow you to decrease your training error but it also reduces the amount of generalization, which can be very important depending on your problem. This balance is something your learn to manage the more times you do it.

There are different activation functions that can be used, in this post we will the logistic function, which will look familiar:

Forward Propagation

The activation of the ‘neurons’ in our hidden layers is done by an algorithm called forward propagation which takes us from our input layer to our output layer. This is how it works visually:

image you have a neural network consisting of an input layer of 3 features, one hidden layer of 5 neurons, another hidden layer of 4 neurons, and an output layer of 3 classes. The network would be:

Our input layer will be a m x 4 matrix, where m is the number of observations and 4 the number of features including the bias term (3+1)

The step from the input layer to the first hidden layer is done by multiplying the input layer by our first thetas matrix, which is the first set of parameters our model will need to ‘learn’ in order to minimize the cost function I will show later:

The multiplication gives us:

on which we will calculate the logistic function and add a bias term, therefore our first hidden layer is equal to:

The step from the first hidden layer to the second hidden layer is done by multiplying the input layer by our second thetas matrix:

The multiplication gives us:

on which we will calculate the logistic function and add a bias term, therefore our second hidden layer is equal to:

The step from the second hidden layer to the output layer is done by multiplying the second hidden layer by our third thetas matrix:

The multiplication gives us:

on which we will calculate the logistic function:

and so we have all our layers:

Cost function

As with Logistic Regression, we are facing an optimization problem. Keeping as an example the neural network shown above, we need to find the thetas which minimize our cost function. The thetas are contained within the three thetas matrices shown above:

It is important to notice that the first row of each theta matrix corresponds to the bias terms (they’re multiplied against the bias terms of their corresponding layer), this is an observation which will need to consider as we move along.

The cost function to minimize is:

Let’s look first at the part before the plus sign

This part is similar to that of the logistic regression, the only difference is that now we’re taking into account the errors on each class of Y, which is represented by the sum from k = 1 to p, where p is the number of classes in the output layer. If we are predicting just one class, like alive/dead, sick/not sick, then it would be exactly like the one of logistic regression. Imagine our training data is made up of just five rows, and we’ve a 3 classes classification problem, then our Y matrix may look like this:

If the first column was cat, the second column was a bird and the third column was a dog then this means that in our training data, the first image is of a cat, the second image is of a cat, the third image is of a bird, the fourth image is of a dog and the fifth image is of a bird.

Behind the scenes the first part of the cost function is doing (notice the element wise multiplication)

**+**

This results in a 5×3 matrix for which we need to sum all the elements and divide them by the number of samples (5) multiplied by -1, so **-1/m**.

The second part:

is adding regularization. L represents the number of layers, in our example we’ve 4 layers and 3 thetas matrices, therefore the summation goes from h = 1 to 3 (L-1). Then we’re taking the square of all the thetas and summing them, except for those related to the bias terms (remember 1st row) therefore j goes from 2 (skips first row) to the size of the **hth** layer (**S**h), and i goes from 1 to the size of the **ith+1** layer(**S**h+1). Lambda is our regularization parameter.

Back propagation

The goal is to find the values of thetas which minimize the cost function shown above. In order to do this, we will use the back propagation algorithm. Let’s consider again our neural network:

We need to find the partial derivative of the cost function for each theta. In order to do this we will use the back propagation algorithm. Here is how it works:

Calculate the delta corresponding to the output layer, this is equal to:

this is a **5×3** matrix in our example

Once we have the delta, we are able to calculate the derivative of the cost function with respect to all the thetas belonging to the third thetas matrix: which is:

Calculate the delta corresponding to the third layer:

this is a **5×5** matrix in our example

from which we remove the bias terms, which now correspond to the first column because we’ve transposed the thetas 3 matrix, so it becomes a **5×4** matrix

now we can calculate the derivative of the cost function with respect to the second thetas matrix:

Calculate the delta corresponding to the second layer:

this is a **5×6** matrix in our example

from which we remove the bias terms, which now correspond to the first column because we’ve transposed the thetas 2 matrix, so it becomes a **5×5** matrix

now we can calculate the derivative of the cost function with respect to the first thetas matrix:

You can see the patterns:

The deltas, except for the one corresponding to the output layer, is equal to:

and you remove the first column

And the partial derivatives:

What this algorithm is doing is propagating the error backwards from the output layer up until the first hidden layer, therefore our algorithm stops when we reach delta 2.

What our code will need to do is:

Writing the Class

So, this is where the fun begins! Before starting to write our class, it is important to stop and think about how to implement it. The first parameters we need to pass to it are the number and sizes of each hidden layer, the value of the regularization parameter and the number of maximum iterations it should perform while minimizing the cost function. Therefore we have, the following __init__ method:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | def __init__(self, hidden_layers: tuple, regularization_parameter=0.0, max_iter=200): ''' PARAMETERS ---------- hidden_layers: a tuple containing the layers sizes\n the length of the tuple is the number of hidden layers and each element is the size of the ith hidden layer\n Example: (10,20,30): 3 hidden layers of sizes: 10 (1st hidden layer) 20 (2nd hidden layer) 30 (3rd hidden layer)\n regularization_parameter: the regularization parameter, if no regularization then = 0\n max_iter: Maximum number of function evaluation. if None, maxiter is set to max(100, 10*len(x0)). Defaults to None ''' self._hidden_layers = hidden_layers self._regularization_parameter = regularization_parameter self._max_iter = max_iter |

The second method we need to add is to calculate the logistic function:

1 2 3 4 5 6 7 8 9 10 11 | def _logistic_function(self, H: np.ndarray): ''' DESCRIPTION ----------- Calculate the logistic function over each element of H PARAMETERS --------- H: a numpy ndarray ''' return 1 / (1 + np.exp(-H)) |

The first thing our algorithm will need to do when the user trains the model is to create our thetas matrices. The size of each theta matrix is given by the sizes of each layer. A specific theta matrix l: has a number of rows equal to the size of l layer + 1: and has a number of column equal to the size of the l+1 layer:

Because the scipy.optimize algorithm needs a a vector of partial derivatives, we need to create a flattened version of our thetas matrices (all the thetas contained in a one dimensional array):

into

The below method will create such flattened vector of thetas and also store the size of each theta matrix in order to recreate it when we vectorize the forward and back propagation steps:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | def _make_thetas(self, layers: tuple): ''' DESCRIPTION ----------- Create n-1 thetas matrixes, where n is the number of layers, len(layers)\n Each theta matrix has a size of:\n rows = (number of elements + 1) of the ith layer columns = the number of elements of the (ith + 1) layer\n Each theta matrix is initialized by np.random.randn PARAMETERS ---------- layers: a tuple where each element represents the size of the ith layer.\n it includes the input, output layers RETURNS --------- None ''' self.thetas = [] self._thetas_sizes = [] for i in range(len(layers))[:-1]: rows = layers[i] cols = layers[i+1] self.thetas.extend(np.random.randn(rows+1, cols).flatten(order='C')) self._thetas_sizes.append((rows+1, cols)) |

we also need a method, as said, to reshape the flattened thetas into their appropriate sizes in order to take advantage of vectorized operations, the following method will do that for us:

1 2 3 4 5 6 7 8 9 10 11 12 | def _reshape_thetas(self, flattened_thetas): # reshape thetas based on self._thetas_sizes reshaped_thetas = [] elements_count = 0 for theta_size in self._thetas_sizes: rows, cols = theta_size elements = rows * cols reshaped_thetas.append(np.reshape( flattened_thetas[elements_count:elements+elements_count], newshape=(rows, cols), order='C')) elements_count += elements return reshaped_thetas |

So far, we have a way to create our flattened thetas, initialize them with random values and reshape them into appropriate matrices as needed. So we can take care of writing the forward propagation method, this will accept two parameters: X (a matrix corresponding to the input layer without the bias term which will be added by our method) and a list of thetas matrices properly shaped:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | def _forward_propagation(self, X: np.ndarray, thetas): ''' DESCRIPTION ----------- Performs forward propagation, calculating the values for each hidden layer and for the output layer. The input layer corresponds to X with the bias column added PARAMETERS ---------- X: numpy matrix (n_samples, n_features) thetas: a list of matrixes RETURNS --------- a list where each element is the calculation for each hidden layer plus the output layer. The output layer is the last element of the list ''' # store number of rows rows, _ = X.shape # layers is a list with the computations for each hidden layer and for the # output layer. the output layer is the last element of the list layers = [] # add bias column current_layer = np.hstack((np.ones((rows, 1)), X)) # go through the thetas matrixes and calculate each layer for index, theta_matrix in enumerate(thetas): # ith + 1 layer = matrix multiplication of the ith layer and the ith thetas current_layer = self._logistic_function( np.matmul(current_layer, theta_matrix)) # add bias to the calculated layer except for the output layer if (index + 1) != len(thetas): current_layer = np.hstack((np.ones((rows, 1)), current_layer)) layers.append(current_layer) return layers |

Now that we’ve a way to calculate each layer, we’re also ready to add the calculation of the cost function, which makes use of the above methods:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | def _cost(self, thetas, X: np.ndarray, Y: np.ndarray): ''' DESCRIPTION ----------- Calculates the cost function PARAMETERS ---------- Thetas: a flattened list containing all the thetas. This will get properly reshaped into a list of matrix using self._thetas.sizes X: numpy array, the input layer without the bias term Y: numpy array ''' # reshape thetas based on self._thetas_sizes reshaped_thetas = self._reshape_thetas(thetas) # calculate each layer # the last element in the list is the output layer calculated_layers = self._forward_propagation(X, reshaped_thetas) output_layer = calculated_layers[-1] y_samples, y_features = Y.shape total_cost = 0.0 # calculate the cost total_cost = (Y * np.log(output_layer) + (1-Y) * np.log(1-output_layer)).sum() / -y_samples # if regularization added then regularize parameters # do not regularize parameters corresponding to bias terms # which corresponds to the first row of each theta matrix if self._regularization_parameter != 0.0: for theta_matrix in reshaped_thetas: total_cost += np.sum( np.power(theta_matrix[1:, :], 2)) / (2*y_samples) #print('Training the model. Cost value: {0}'.format(total_cost)) return total_cost |

So we have our cost function and the forward propagation, we are just missing the gradient vector to feed into scipy.optimize function. But in order to have that, we need to implement back propagation first. Our back propagation algorithm will first calculate the delta corresponding to the output layer and then, in a backward loop, calculate all the remaining deltas based on how many layers we have and each time a delta is calculate, it also calculates the partial derivatives with respect the current thetas matrix:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | def _back_propagation(self, thetas, X, Y, calculated_layers): ''' DESCRIPTION ---------- Performs back propagation, calculating the deltas and the derivative for each layer PARAMETERS ---------\n thetas: a list of properly shaped matrixes representing the current thetas values\n X: numpy matrix (n_samples, n_features)\n Y: numpy matrix (n_samples, n_classes)\n calculated_layers: the layers calculated by the forward propagation\n RETURNS --------- a list where each element s a matrix of the derivatives for each layer ''' deltas = [] gradients = [] # store rows and cols rows, cols = Y.shape # add bias column to X X = np.hstack((np.ones((rows, 1)), X)) # the last delta is the output layer minus Y current_delta = calculated_layers[-1] - Y # add it to the deltas list deltas.append(current_delta) # loop backwards the calculated_layers going from the n-1 element to the 1st element # and backwards the thetas going from the n element to the 2nd element for layer, thetas_matrix in zip(calculated_layers[-2::-1], thetas[-1:0:-1]): # calculate the derivatives with respect to the thetas_matrix current_gradient = np.matmul(layer.T, current_delta) / rows if self._regularization_parameter != 0: current_gradient += thetas_matrix * (self._regularization_parameter / rows) gradients.append(current_gradient) # calculate the next delta current_delta = np.matmul(current_delta, thetas_matrix.T) * layer * (1-layer) # remove the first column which is linked to the biases terms current_delta = current_delta[:, 1:] # calculate the gradient for the first thetas matrix current_gradient = np.matmul(X.T, current_delta) / rows if self._regularization_parameter != 0: current_gradient += thetas[0] * (self._regularization_parameter / rows) gradients.append(current_gradient) # reverse the list containing the gradients, at the moment they're in reverse order gradients.reverse() return gradients |

now we’re able to create the gradient vector:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | def _gradient(self, thetas, X, Y): ''' DESCRIPTION ----------- Returns a list of the gradients. Starts by reshaping the thetas into their matrix sizes then perform forward propagation. Once done, it performs back propagation in order to calculate the gradient vector.\n PARAMETERS ----------\n thetas: a flattened list of all the parameters\n X: numpy matrix (n_samples, n_features)\n Y: numpy matrix (n_samples, n_classes)\n ''' gradient_vector = [] # reshape thetas based on self._thetas_sizes reshaped_thetas = self._reshape_thetas(thetas) # calculate each layer # the last element in the list is the output layer calculated_layers = self._forward_propagation(X, reshaped_thetas) # perform back propagation, return a list of gradients pertaining to each layer gradients = self._back_propagation(reshaped_thetas, X, Y, calculated_layers) # flatten the gradients for gradient in gradients: gradient_vector.extend(gradient.flatten(order='C')) return gradient_vector |

all the methods implement above are all ‘private’. When using the class, we will use the below ‘fit’ method to optimize the model:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | def fit(self, X: np.ndarray, Y: np.ndarray): ''' PARAMETERS ---------- X: m X n matrix where m is the number of observations and n is the number of features not including the bias\n Y: observed values. If multiclass classification then Y is a m X p matrix where p is the number of classes. Each column is a binary column with either a 0 (when the observation does not correspond to the ith class) or a 1 (when the observation corresponds to the ith class) ''' # store rows and cols sizes x_rows, x_cols = X.shape y_rows, y_cols = Y.shape # initialize the thetas, they will be store as a flattened list in self._thetas self._make_thetas((x_cols,) + self._hidden_layers + (y_cols,)) optimized = op.minimize(fun=self._cost, x0=self.thetas, args=( X, Y), method='TNC', jac=self._gradient, options={'maxiter': self._max_iter}) self.thetas = optimized.x |

The last method missing is a predict method:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | def predict(self, X): ''' PARAMETERS ---------- X: numpy array (n_samples, n_features) RETURNS: a numpy array (n_samples, n_classes). For each row a 1 for the predicted class and zero for the rest ''' # reshape theas reshaped_thetas = self._reshape_thetas(self.thetas) layers = self._forward_propagation(X, reshaped_thetas) return (layers[-1] == np.max(layers[-1], axis=1, keepdims=True)) * 1 |

Testing the Class

It is now time to test the model on real data. In this project we will categorize number digits from 0 to 9 by looking at images. The data set can be downloaded at the following links:

images: https://github.com/LivioLanzo/image_classification_raw_data/raw/master/images.gz

labels: https://github.com/LivioLanzo/image_classification_raw_data/raw/master/labels.gz

and are those used by Andrew Ng course: https://www.coursera.org/learn/machine-learning

The first file contains 5000 rows where each row represent a 20 pixel by 20 pixel grayscale image of the digit. Each pixel is represented by a floating point number indicating the grayscale intensity at that location. The pixels are unrolled into a 400-dimensional vector (20 x 20).

The second file contains 5000 rows and tells us how each image was classified. It contains 10 binary columns, where they represent the digits: 1, 2, 3, 4, 5, 6, 7, 8, 9, 0. For example

tells us that first image of a 0, the second image of 4 and the third image of 6.

Below is a Jupyter Notebook example of how to use the class and the accuracy obtained from the dataset used. You can test different regularization parameters and number of iterations to see how this affect the predictions:

The code of the whole class can be downloaded at:

https://github.com/LivioLanzo/image_classification_raw_data/blob/master/NeuralNetwork.py