Gradient Descent and Logistic Regression

8 minute read

Logistic Regression

In this post we will look at the most popular and useful algorithms in Machine Learning, and the building block of all that constitutes Deep Learning. The Logistic Regression Algorithm. And it basically goes like this:

  • Take your data
  • Pick a random model
  • Calculate the error
  • Minimize the error, and obtain a better model

Before we calculate the error, let’s see how to define an error function for Logistic Regression.

Calculate the Error Function for Logistic Regression

Let’s say we have two models, the bad model on the left and the good model on the right. For each one of those we calculate the Cross Entropy - which is the sum of the negatives of the logarithms of the probabilities of the points being their colors. And we conclude that the one on the right is better because it has a low cross entropy.

So, now let’s calculate the formula for error function. We will have 2 cases:

  • When y=1 - which means the point is blue to begin with. The model tells us that the probability of being blue is the prediction y-hat. For example, the P(blue) = 0.6 and P(blue) = 0.2 for the two points in the below figure. If you notice, the point in the blue area has more probability of being blue than the point in the red area. And our Error is simply the negative of the logarithm of this probability.

  • When y=0 - which means the point is red, we need to calculate the probability of the point being red. The probability of the point being red is 1 - P(blue) which is 1 - yhat. And the corresponding Error is simply the negative of the logarithm of this probability.

Thus, the Error function can be summarized into 1 equation as shown above.

Now, the formula for the Error Function is simply the sum over all the error functions of points which is precisely the summation shown here. Note, that by convention, we have to take the average and not the sum, which is why we are dividing by m over here.

From now on, we will use this formula as our Error Function.

Express the Error Function in terms of Weights of the model

From the earlier posts, we learnt that y-hat is given by the sigmoid of the linear function WX + b, hence the formula for the error is actually in terms of W and b which are the weights of the model.

Error function is expressed in terms of the weights of the model. Because ultimately we wish find the weights of the line that best separates our data.

Now, that we have an Error Function expressed in terms of weights, our goal is to minimize this Error Function, which we will look into next.

How to minimize the Error Function?

Our goal is now to minimize the above Error Function. We start with some random weights, this will give us the predictions sigmoid(Wx + b) - which is a line as shown:

Recall that if a point is misclassified, it will have a larger error. Now, our goal is to minimize this error function using gradient descent. So, we have our mountain here again, and we are going to jiggle the line around until we can minimize the error.

The error function is the height E(W, b), where W and b are weights and bias of the line. We will use gradient descent to get to the bottom of the mountain, thus reducing the height. This gives us a smaller error function with weight W-prime and b-prime.

Understanding Gradient Descent

What is Gradient?

Our Error Function is a function of our weights, which can be graphed like shown below, which has a mathematical structure rather than a mountain.

So, imagine that we are standing somewhere on this mathematical function and we need to go down. So, now the inputs to the function are W1 and W2 and the error function is given by E. Then the gradient of E is given by the vector sum of the partial derivates of E with respect to W1 and W2. This gradient actually tells us the direction we have to move to increase the Error Function the most.

Thus, if we take the negative of the gradient, this will tell us how to decrease the error function the most. So, at the point that we are standing, we will take the negative of the gradient of the error function at that point. Then, we take a step in that direction. Once we take the step, we will be in a lower position, so we do it again and again until we are able to get to the bottom of the mountain.

The Gradient Descent Step

  1. We start with our initial prediction y-hat. Let’s say that this prediction is bad, because the error is too high. We can express this y-hat in terms of weights as shown in the second line.
  2. The error function is given by the formula we saw before. But what matters here is the gradient of the error function, which is given in line 3.
  3. Now we take a step in the negative direction of the gradient. However, we don’t want to make any drastic changes, so we will introduce something called as a learning rate, and mulitply the gradient with that small number. But, what exactly do we mean by taking a step?
  4. Taking a step is exactly the same thing as updating the weights and bias as shown in line 5. The weight W(i) will now become W(i)-prime - which is given by the formula on line 5. In other words, we are updating the weight by subtracting a quantity of learning_rate * gradient.
  5. Now this will take us to a prediction with a lower error function. So, we can conclude that the prediction we have now with weights W-prime and b-prime is better than the one we have before.

This is what we mean by a gradient descent step.

But how to get the partial derivative of E w.r.t W and b?

Turns out, that the partial derivate of E can be mathematically proved to be equivalent to the following:

Revised Gradient Descent Step

Thus, we have our revised Gradient Descent Step as:

Putting it all together: Gradient Descent Algorithm

Now, we have all the tools to define the pseudocode for the gradient descent algorithm.

Step 1: Start with random weights and bias, which will give us a line. In fact, not just a line, but the whole proability function given by the sigmoid of Wx + b.

Step 2: Now for every point, we will calculate the error, and as we have seen the error is large for misclassified points but small for correctly classified points.

Step 2a: Now for every point with coordinates (x1, .. xn), we update the W(i) by adding the learning_rate * partial derivate of the error function w.r.t W(i), similarly we also update the bias b.

This gives rise to new weights and biases.

Step 2b: Now we already calculated these partial derivates and we know that they are as follows:

Step 3: Now we repeat this process until the error is small or we also can choose to repeat this process for a fixed number of times: called Epochs.

Appendix

Role of derivatives

Machine learning uses derivatives in optimization problems. Optimization algorithms like gradient descent use derivates to actually decide whether to increase or decrease the weights in order to increase or decrease any objective function.

If we are able to compute the derivative of a function, we know in which direction to proceed to minimize it.

Two main goals of gradient descent

Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks. It is an iterative optimization algorithm used to find the minimum value for a function.

With Gradient Descent, we are after two things: 1. Which direction we need to go, given by the gradient. 2. How big a step to take, given by the learning rate.

Difference between Weights and Inputs

A key idea when applying gradient descent to our neural network is that we consider input values (x) to be constants, with our goal being to adjust the weights (w), including the bias input weight (w0). This might seem odd given our description of gradient descent, where we try to find input values that minimize a function. At first sight, for the two-input perceptron, it seems like x1 and x2 would be considered input values. That would be true if we had a perceptron with fixed weights and a desired output value and the task at hand was to find the x-values that result in this output value given the fixed weights. However, this is not what we are trying to do with our learning algorithm. The purpose of our learning algorithm is to, given a fixed input (x1, x2), adjust the weights (w0, w1, w2) so that the output value takes on the value we want to see. That is, we treat x1 and x2 as constants (x0 as well, but that is always the constant 1, as stated earlier), while we treat w0, w1, and w2 as variables that we can adjust.

During learning, not the inputs (x) but the weights (w) are considered to be the variables in our function.

To make it more concrete, if we are training a network to distinguish between a dog and a cat, the pixel values would be the inputs (x) to the network. If it turned out that the network incorrectly classified a picture of a dog as being a cat, we would not go ahead and adjust the picture to look more like a cat. Instead, we would adjust the weights of the network to try to make it correctly classify the dog as being a dog.

Difference b/w Stochastic gradient descent and batch gradient descent

The cost function is given by the following equation, and we need to find the derivative of the Cost Function w.r.t to both ‘m’ and ‘b’.

The distinction between stochastic and true gradient descent is that, with true gradient descent, we would compute the gradient as the mean value of the gradients for all individual training examples, whereas with SGD, we approximate the gradient by computing it for only a single training example. There are also hybrid approaches in which you approximate the gradient by computing a mean of some, but not all, training examples.

Gradient descent requires you to compute the gradient for all input examples before updating the weights, but stochastic gradient descent only requires you to compute the gradient for a single input example.