Livio/ July 14, 2019/ Python/ 0 comments

Logistic Regression

Logistic regression is a technique which can be applied to traditional statistics as well as machine learning. It is used to predict whether something is true or false and can be used to model binary dependent variables, like win/loss, sick/not stick, pass/fail etc.

Logistic regression assumes a linear relationship between the predictor variables and the probability of the event being True. This relationship is expressed by the following formula:

where the left side represents the natural logarithm of the odds, and the right size should look very familiar to those already acquainted with linear regression. P represents the probability of the event being True and consequently 1-P is the probability of the event not being True. Because P is between 0 and 1, if we take the limit for P = 0 and for P = 1 we have:

Because the final goal is to predict the probability of an event being True given the predictor variables, we need to find a formula for P. Starting from:

we have:

which can also be written as:

The equation:

is known as logit function and in the case of Logistic Regression gives us the probability, given the predictors, of an event being True.

If, for sake of simplicity, we make for a moment:

$\fn_cm \large t = \Theta_{0} + \Theta_{1}X_{1} + ... + \Theta_{n}X_{n}$

we have:

$\fn_cm \large p = \frac{1}{(1+e^{-t})}$  where t is the log of the odds

we can see that the above function is defined for any value of t. Moreover, when t = 0, P = 0.5

$\fn_cm \large p =\frac{1}{1+e^{0}} = \frac{1}{1+1} = \frac{1}{2} = 0.5$

If we calculate the first derivative:

$\fn_cm \large \frac{\partial }{\partial t}(\frac{1}{1+e^{-t}}) = \frac{e^{-t}}{(1+e^{-t})^{2}} =\frac{1}{(\frac{1+e^{t}}{e^{t}})(\frac{1+e^{t}}{e^{t}})e^{t}} = \frac{1}{(\frac{1+e^{t}}{e^{t}})(1+e^{t})} = \frac{1}{\frac{(1+e^{t})^{2}}{e^{t}}} = \frac{e^{t}}{(1+e^{t})^{2}}$

we see that it positive for any value of t

And the second derivative:

$\fn_cm \large \frac{\partial^2 }{\partial t^2}(\frac{1}{1+e^{-t}}) = \frac{e^{t}(1+e^{t})^{2} - 2e^{2t}(1+e^{t})}{(1+e^t)^{4}} = \frac{e^{t}(1+e^{t}) - 2e^{2t}}{(1+e^t)^{3}}$

we see that it is equal to zero when:

$\fn_cm \large e^{t}(1+e^{t}) - 2e^{2t} = 0$

which is when t = 0. For negative values of t, the second derivative is positive (logit function is concave up), for positive values of t, the second derivative is negative(logit function is concave down). Therefore we can sketch our logit function (which represents the probability of an event being True):

When the log of the odds is negative we have a probability of less than 50% of the event being True, when the log of the odds is positive we have a probability of more than 50% of the event being True.

The Cost Function

Our problem will be an optimization problem, which means finding the values of thetas

$\fn_cm \large \Theta_{0} + \Theta_{1}X_{1} + ... + \Theta_{n}X_{n}$

which minimize the Cost function, which for logistic regression, is the following:

$\large Cost = -y\ln(p) - (1-y)\ln(1-p)$

where y is the observed outcome (True/False, or better 1,0) and p is the probability of the event being True given our parameters. Remember that:

$\large p = \frac{1}{1+e^{-(\Theta_{0} + \Theta_{1}X_{1} + ... + \Theta_{n}X_{n})}}$

When the observed outcome is 1 (True), (1-y) * ln(1-p) evaluates to 0, therefore only the part -y* ln(p) is considered and vice verse when the outcome is 0 (False).

If we quickly study the function ln(p) we see that:

$\large -\lim_{p\rightarrow 0}\ln(p) = + infinity$

$\large -\lim_{p\rightarrow 1}\ln(p) = 0$

and:

$\large -\frac{\partial }{\partial p}\ln(p) = -\frac{1}{p}$

$\large -\frac{\partial^2 }{\partial p^2}\ln(p) = \frac{1}{p^{2}}$

and because 0<=p<=1, over this interval the first derivative is always negative and the second derivative is always positive. Therefore we can sketch our function:

If we think about it, when the observed value is  y = 1 (True),  which is when the cost function is equal to -y * ln(p) = -ln(p), we want the ‘cost’ to increase as our prediction moves towards 0 (False, wrong prediction) and to decrease when our prediction moves toward 1 (True, correct prediction).

We can do the same kind of analysis when the observed value is y= 0 (False). Our cost function becomes -(1-y) * ln(1-p) = -ln(1-p). If we study this function, we get:

$\large -\lim_{p\rightarrow0}\ln(1-p) = 0$

$\large -\lim_{p\rightarrow1}\ln(1-p) = + infinity$

$\large -\frac{\partial }{\partial p}ln(1-p) = - (-1)\frac{1}{1-p} = \frac{1}{1-p}$

$\large -\frac{\partial^2 }{\partial p^2}ln(1-p) = \frac{(0(1-p) - (-1))}{(1-p)^2} = \frac{1}{(1-p)^2}$

Therefore we can sketch this function:

If we think about it, when the observed value is  y = 0 (False),  which is when the cost function is equal to -(1-y) * ln(1-p) = -ln(1-p), we want the ‘cost’ to increase as our prediction moves towards 1 (True, wrong prediction) and to decrease when our prediction moves toward 0 (False, correct prediction).

So our goal is to find the thetas which minimize this cost function. To do this, we will use the Truncated Newton Method within Scipy. But first, we need to calculate the gradient vector, the vector of partial derivatives for each of our thetas. I will show how to do this for Theta1, and luckly the process is the same for all the other thetas, so we’re in luck 🙂

I am showing below the whole procedure:

$\LARGE \begin{bmatrix} x_{0}(\frac{1}{1+e^{-(\Theta_{0} + \Theta_{1}X_{1} + ... + \Theta_{n}X_{n})}}-y) \\ x_{1}(\frac{1}{1+e^{-(\Theta_{0} + \Theta_{1}X_{1} + ... + \Theta_{n}X_{n})}}-y) \\ x_{2}(\frac{1}{1+e^{-(\Theta_{0} + \Theta_{1}X_{1} + ... + \Theta_{n}X_{n})}}-y) \\ .... \\ x_{n}(\frac{1}{1+e^{-(\Theta_{0} + \Theta_{1}X_{1} + ... + \Theta_{n}X_{n})}}-y) \end{bmatrix}$

Building a Logistic Regression Class in Python

In reality, because we will deal with many observation, the Cost function will be the sum for each observation of the Cost function shown above and divided by the number of observations, therefore it becomes:

$\fn_cm Cost = \frac{1}{m} \sum_{i=1}^{m} -y^{(i)} \ln(\frac{1}{1+e^{-(\theta_{0} + \theta_{1}X_{1}^{(i)} + ...+\theta_{n}X_{n}^{(i)} )}}) -(1-y^{(i)}) \ln(1-\frac{1}{1+e^{-(\theta_{0} + \theta_{1}X_{1}^{(i)} + ...+\theta_{n}X_{n}^{(i)} )}})$

and the gradient vector is like the one shown above, but with 1/m added:

$\fn_cm \large \begin{bmatrix} \frac{1}{m} \sum_{i=1}^{m} X_{0}^{(i)}(\frac{1}{1+e^{-(\theta_{0} + \theta_{1}X_{1}^{(i)} + ...+\theta_{n}X_{n}^{(i)} )}} - y^{(i)}) \\ \frac{1}{m} \sum_{i=1}^{m} X_{1}^{(i)}(\frac{1}{1+e^{-(\theta_{0} + \theta_{1}X_{1}^{(i)} + ...+\theta_{n}X_{n}^{(i)} )}} - y^{(i)}) \\ ..... \\ \frac{1}{m} \sum_{i=1}^{m} X_{n}^{(i)}(\frac{1}{1+e^{-(\theta_{0} + \theta_{1}X_{1}^{(i)} + ...+\theta_{n}X_{n}^{(i)} )}} - y^{(i)}) \end{bmatrix}$

remember that m is the number of observations, whereas n is the number of attributes. The gradient vector contains n+1 elements because X_0 is a column of 1’s which is multiplied against theta_0, the intercept. This makes matrix calculation easier.

The matrix of predictors is:

$\fn_cm \large X=\begin{bmatrix} X_{1,0} &X_{1,1} &X_{1,2}^ &... &X_{1,n} \\ X_{2,0} &X_{2,1} &X_{2,2}^ &... &X_{2,n} \\ .. &.. &.. &.. &.. \\ ..&.. &.. &.. &.. \\ X_{m,0} &X_{m,1} &X_{m,2}^ &... &X_{m,n} \\ \end{bmatrix} = (x_{i,j}) \epsilon \mathbb{R}^{m x n}$

Our thetas vector:

$\fn_cm \large \theta = \begin{bmatrix} \theta_{0} \\ \theta_{1} \\ ... \\ ... \\ \theta_{n} \end{bmatrix} = (\theta_{i,j}) \epsilon \mathbb{R}^{nx1}$

and our observed outcomes:

$\fn_cm \large Y = \begin{bmatrix} Y_{1} \\ Y_{2} \\ ... \\ ... \\ Y_{m} \end{bmatrix} = (y_{i,j}) \epsilon \mathbb{R}^{mx1}$

The first method we will add will be called Sigmoid, which converts the log-odds vector into a probability vector:

$\fn_cm \large \begin{bmatrix} \theta_{0} X_{1,0} + \theta_{1}X_{1,1} + ...+\theta_{n}X_{1,n} \\ \theta_{0} X_{2,0} + \theta_{1}X_{2,1} + ...+\theta_{n}X_{2,n} \\ ... \\ ... \\ \theta_{0} X_{m,0} + \theta_{1}X_{m,1} + ...+\theta_{n}X_{m,n} \end{bmatrix} \rightarrow \begin{bmatrix} \frac{1}{1+e^{-(\theta_{0} X_{1,0} + \theta_{1}X_{1,1} + ...+\theta_{n}X_{1,n}})} \\ \frac{1}{1+e^{-(\theta_{0} X_{2,0} + \theta_{1}X_{2,1} + ...+\theta_{n}X_{2,n}})} \\ ... \\ ... \\ \frac{1}{1+e^{-(\theta_{0} X_{m,0} + \theta_{1}X_{m,1} + ...+\theta_{n}X_{m,n}})} \end{bmatrix}$

This is very easy in Python using Numpy:

The second method will be used to calculate the cost function. Remember that the cost function is:

$\fn_cm Cost = \frac{1}{m} \sum_{i=1}^{m} -y^{(i)} \ln(\frac{1}{1+e^{-(\theta_{0} + \theta_{1}X_{1}^{(i)} + ...+\theta_{n}X_{n}^{(i)} )}}) -(1-y^{(i)}) \ln(1-\frac{1}{1+e^{-(\theta_{0} + \theta_{1}X_{1}^{(i)} + ...+\theta_{n}X_{n}^{(i)} )}})$

therefore, we first need to calculate the log-odds, then the probabilities and then the natural logarithm. In order to calculate the log-odds, which is:

we can perform a matrix multiplication between the predictors and the thetas:

$\begin{bmatrix} X_{1,0} &X_{1,1} &X_{1,2}^ &... &X_{1,n} \\ X_{2,0} &X_{2,1} &X_{2,2}^ &... &X_{2,n} \\ .. &.. &.. &.. &.. \\ ..&.. &.. &.. &.. \\ X_{m,0} &X_{m,1} &X_{m,2}^ &... &X_{m,n} \\ \end{bmatrix} \cdot \begin{bmatrix} \theta_{0}\\ \theta_{1}\\ ...\\ ...\\ \theta_{n} \end{bmatrix} = \begin{bmatrix} X_{1,0}\cdot \theta_{0} + X_{1,1}\cdot \theta_{1} + ... + X_{1,n}\cdot \theta_{n} \\ X_{2,0}\cdot \theta_{0} + X_{2,1}\cdot \theta_{1} + ... + X_{2,n}\cdot \theta_{n} \\ ...\\ ...\\ X_{m,0}\cdot \theta_{0} + X_{m,1}\cdot \theta_{1} + ... + X_{m,n}\cdot \theta_{n} \\ \end{bmatrix}$

then convert calculate the cost function:

$\fn_cm \large \begin{bmatrix} -Y_{1,1}\\ -Y_{2,1}\\ ...\\ ...\\ -Y_{m,1} \end{bmatrix}^{T} \cdot \begin{bmatrix} \ln(\frac{1}{1+e^{-\sum_{i=0}^{n}X_{(1,i)}\theta_{i}}}) \\ \ln(\frac{1}{1+e^{-\sum_{i=0}^{n}X_{(2,i)}\theta_{i}}}) \\ \\ ...\\ ...\\ \ln(\frac{1}{1+e^{-\sum_{i=0}^{n}X_{(m,i)}\theta_{i}}}) \\ \end{bmatrix} - \begin{bmatrix} 1-Y_{1,1}\\ 1-Y_{2,1}\\ ...\\ ...\\ 1-Y_{m,1} \end{bmatrix}^{T} \cdot \begin{bmatrix} \ln(1-\frac{1}{1+e^{-\sum_{i=0}^{n}X_{(1,i)}\theta_{i}}}) \\ \ln(1-\frac{1}{1+e^{-\sum_{i=0}^{n}X_{(2,i)}\theta_{i}}}) \\ \\ ...\\ ...\\ \ln(1-\frac{1}{1+e^{-\sum_{i=0}^{n}X_{(m,i)}\theta_{i}}}) \\ \end{bmatrix}$

the above will return a scalar value which will be divided by the number of the observations. Python code:

The same things now needs to be done with the creation of the gradient vector, which will be a vector of the values of the partial derivatives.

$\begin{bmatrix} \frac{\partial }{\partial \theta_{0}} = \frac{1}{m} \sum_{i=1}^{m} X_{(i,0)}(\frac{1}{1+e^{-(\theta_{0}X_{i,0} + \theta_{1}X_{i,1} + ... + \theta_{n}X_{i,n})}} - Y_{(i,1)}) \\ \frac{\partial }{\partial \theta_{1}} = \frac{1}{m} \sum_{i=1}^{m} X_{(i,1)}(\frac{1}{1+e^{-(\theta_{0}X_{i,0} + \theta_{1}X_{i,1} + ... + \theta_{n}X_{i,n})}} - Y_{(i,1)}) \\ ... \\ ... \\ \frac{\partial }{\partial \theta_{n}} = \frac{1}{m} \sum_{i=1}^{m} X_{(i,n)}(\frac{1}{1+e^{-(\theta_{0}X_{i,0} + \theta_{1}X_{i,0} + ... + \theta_{n}X_{i,n})}} - Y_{(i,1)}) \end{bmatrix}$

letting the probability for i-th row equal to:

$P_{i} = \frac{1}{1+e^{-(\theta_{0}X_{i,0} + \theta_{1}X_{i,1} + ... + \theta_{n}X_{i,n})}}$

we can rewrite the gradient computation as:

$\begin{bmatrix} X_{1,0} &X_{1,1} &... &... &X_{1,n} \\ X_{2,0} &X_{2,1} &... &... &X_{2,n} \\ ... &... &... &... &... \\ ... &... &... &... &... \\ X_{m,0} &X_{m,1} &... &... &X_{m,n} \\ \end{bmatrix}^{T} \cdot \begin{bmatrix} P_{1} - Y_{1,1}\\ P_{2} - Y_{2,1}\\ ... \\ ... \\ P_{m} - Y_{m,1} \end{bmatrix}$

$\begin{bmatrix} X_{1,0} &X_{2,0} &... &... &X_{m,0} \\ X_{1,1} &X_{2,1} &... &... &X_{m,1} \\ ... &... &... &... &... \\ ... &... &... &... &... \\ X_{1,n} &X_{2,n} &... &... &X_{m,n} \\ \end{bmatrix} \cdot \begin{bmatrix} P_{1} - Y_{1,1}\\ P_{2} - Y_{2,1}\\ ... \\ ... \\ P_{m} - Y_{m,1} \end{bmatrix}$

which becomes:

$\large \begin{bmatrix} X_{1,0} &X_{2,0} &... &... &X_{m,0} \\ X_{1,1} &X_{2,1} &... &... &X_{m,1} \\ ... &... &... &... &... \\ ... &... &... &... &... \\ X_{1,n} &X_{2,n} &... &... &X_{m,n} \\ \end{bmatrix} \cdot \begin{bmatrix} P_{1} - Y_{1,1}\\ P_{2} - Y_{2,1}\\ ... \\ ... \\ P_{m} - Y_{m,1} \end{bmatrix} = \begin{bmatrix} \frac{X_{1,0}\cdot (P_{1} - Y_{1,1}) + X_{2,0}\cdot (P_{2} - Y_{2,1}) + ... + X_{m,0}\cdot (P_{m} - Y_{m,1})}{m}\\ \frac{X_{1,1}\cdot (P_{1} - Y_{1,1}) + X_{2,1}\cdot (P_{2} - Y_{2,1}) + ... + X_{m,1}\cdot (P_{m} - Y_{m,1}))}{m}\\ ... \\ ... \\ \frac{X_{1,n}\cdot (P_{1} - Y_{1,1}) + X_{2,n}\cdot (P_{2} - Y_{2,1}) + ... + X_{m,n}\cdot (P_{m} - Y_{m,1}))}{m}\\ \end{bmatrix}$

How to calculate R^2 for Logistic Regression

There’re several ways of calculating r-squared for logistic regression. Here I will use the so-called McFadden Pseudo R-squared. The idea is very similar to r-squared of linear regression. In linear regression, r-squared as:

$\large R^{2} = 1-\frac{\sum_{i=1}^{m}(\theta_{0}X_{i,0} + \theta_{1}X_{i,2} + ... + \theta_{n}X_{1,n} - Y_{i})^{2}} {\sum_{i=1}^{m} (Y_{i} - \bar{Y})^{2}}$

and it is the percentage of variation around the mean that goes away when you fit a linear regression model. It goes from 0 to 1. When fitting a line does not improve the variation around the mean, we get 0 and we’ve a very bad model. As the sum of squared errors goes down, R^2 goes up towards 1.

For logistic regression, we can first calculate the probability of an event being True without taking into account any of the predictors, therefore we only look at the observed outcomes and divide the number of Trues by the total number of observations, we will call it this probability o, and we will call the probabilities calculated by our model p.

Therefore, o is simply:

$\fn_cm \large o = \frac{Number-of-Trues}{m}$

and p is:

and R-squared is:

$\fn_cm \large R^{2} = 1- \frac{\sum_{i=1}^{m}y_{i}\ln(p_{i}) + (1-y_{i})ln(1-p_{i})} {\sum_{i=1}^{m}y_{i}\ln(o) + (1-y_{i})ln(1-o)}$

you will notice the numerator is very similar to the cost function, which is also the case for linear regression.

Putting it all together, our class becomes:

Now that we’ve our class, let’s use it to create some predictions, below is a simple project which uses the class we’ve just built to make breast cancer predictions:

Books I recommend to start your machine learning journey: