To understand an error function let’s imagine that we are standing on top of a mountain. Our goal is to desend from the mountain in the most optimal way possible. We start by looking around all the possible directions in which we can walk. Then we pick a direction that makes us descend the most. So we take a step in that direction. In this manner we can decrease the height.
Once we take the step, we start the process again and again always decreasing the height until we go all the way down the mountain, minimizing the height.
In this case, the key metric we use to solve the problem is the height. The height here is our Error. Error is what tells us how badly we are doing at the moment and how far we are from the ideal solution. This method, which I will discuss in more detail later, is called Gradient Descent.
The need for a continuous error function
Consider our problem, where we are trying to split the data. What would be a good way to tell the computer how badly it’s doing? Let’s first look at a naive approach were we just count the number of mistakes. So, in our example below, there are 2 mistakes, i.e, 2 errors, that’s our height.
Just as we did to descend from the mountain, we look around all the directions in which we can move the line inorder to decrease the error. Let’s say we first move the line and decrease the error to
1, and then we again move the line to decrease the error to
0. But things aren’t that simple. In our algorithms, we will be taking very small steps and the reason for that is calculus, because our tiny steps will be calcuated by derivatives. So, you must be wondering, so why can’t we take very small steps here? This is equivalent to doing gradient descent from an Aztec pyramid with flat steps.
In order to do gradient descent our error function must be continuous.
So, our goal is to construct an error function that is continuous so that we can use gradient descent to minimize the error.
Building a continous error function
So, here are six points with four of them correctly classified and two of them incorrectly. The way we are going to construct this error function is that we are going to assign a large penalty to the two incorrectly classified points and small penalties for correctly classified points. The sum of which will be our error.
The penalty is roughly the distance from the boundary when the point is miss-classified and almost zero when the point is correctly classfied.
Now we can decrease the error in small amounts because we can make tiny changes to the line and see if it has reduced the error.
So, we need to build an error function that has this property.
Discrete vs Continuous Predictions
So far, we have seen that in order for us to use Gradient descent our Error function must be continuous. But, there is also one other requirement. Our predictions should also move from discrete to continuous.
Prediction is basically the answer we get from the algorithm.
A discrete prediction will be either 0 or 1, whereas a continuous prediction will be a probability b/w 0 and 1.
Consider the admissions use case. For discreate predictions, it is 1 for accepted and 0 for rejected. Whereas, in continuous predictions, we have the probability of getting accepted. As you can see the probability is a function of the distance from the line.
Moving from Discrete to Continuous predictions
The way we move from discrete to continuous predictions is to simply change our activation function, from the
Step function on the left to the
Sigmoid function on the right.
Sigmoid Functionis simply a function which for large positive numbers will give us values very close to 1, and for large negative numbers will give us values very close to zero. And for numbers that are close to 0, it’ll give you values close to 0.5.
So, in a discrete prediction our model only consisted of a single line that divided into two positive and negative regions. But, with continuous predictions our model consists of an entire probability space. In other words, for each point in the plane, we are given the probability that the label of the point is a blue point.
For example, consider the point on the plane, for this point, the probability of this point being a blue is
0.4 and the probability of this point being a red is
The way we obtain this probability space is simple. We just combine the linear function
WX + b with the
Sigmoid function. So, on the left we have points that represent the lines where
WX + b = -1/0/1 etc. And once we apply the Sigmoid function to each of these values in the plane, we then obtain numbers from
0 to 1 for each point. These numbers are just the probabilites of the point being blue.
The probability of the point being blue is a prediction of the model y-hat.
Perceptron with Sigmoid function
Shown here on the left is the activation function that is a step function. And on the right we have our new perceptron, where the activation function is the sigmoid function.
What this new perceptron does is that it takes the inputs, multiples them by the weights of the edges and adds the results, then applies the sigmoid function. So instead of return 1 or 0 like earlier, it returns values between 0 and 1 such as
Before it used to say that the student got accepted/rejected. Now it says the probability of the student getting accepted is this much.