Maximum Likelihood and Cross Entropy
Let’s say that we have 2 models: one that tells me that my probability of getting accepted is
80% and a second one that tells me that the probability is
55%. Which model is more accurate? If I get accepted, then I would say that the first model is more accurate, whereas, if I get rejected, I would say that the second model is more accurate. Now, that’s just for me. Suppose you also include your friend, in this case, the best model will be the one that gives higher probability to the events that happened to both of us - whether it is accepted or rejected. This method is called maximum likelihood. What we do is we pick the model that gives the existing labels the highest probability. Thus, by maximizing the probability, we can pick the best model.
By maximizing the probability, we pick the best model.
Understanding what maximum likelihood is through an example
Let’s consider 4 points, two blue and two red. Which model looks better? Obviously the model on the right looks much better as it classifies all the points correctly.
Now let’s see why the model on the right is better from a probability perspective. Let’s recall, that our prediction y-hat, is the probability of a point being labeled positive/blue.
y-hat = P(blue). So for the points in the figure, let’s say that our model tells us that the probability of being blue is as shown:
Notice that the points in the blue region are much more likely to be blue, and the points in the red region are much less likely to be blue. Similarly, the probability of the points being red are shown below:
Now, our goal is to calculate the probability of the four points are of the colors that they actually are. In other words, the probability that the two red points are red, and the two blue points are blue. Now if we assume them to be independent events then we have the overall probability as the product of these individual probabilites.
If we do the same thing to the model on the right, we get a higher probability. Thus, we conclude that the model on the right is better.
So, our new goal is to maximize the probability, and this method is known as Maximum Likelihood.
Relation b/w probability and error function
A better model gives us a better probability. Now, the question is How do we maximize this probability? or in other words, how do we minimize the error? Can we define an Error function using probability?
Maximizing the probability is equivalent to minimizing the error.
Problem with products
So we have our two models and we calculated the probabilities. But, what happens when there are hunderds of data points? We will have to take the product for all of them, and it will be an infinitesimally small number.
To fix this, we convert the products into sums using log. So, simply by taking the logarithm of the product of the probabilities, we get our Error Function. However, since the probabilities are b/w 0 and 1, we always get negative numbers.
Thus, our Error function is the
-log(Probabilities), which is called Cross Entropy.
A good model has low cross entropy. Let’s see why.
If we calculate the probabilities and pair the points with the corresponding logarithms, we actually get an error for each point. If you look carefully at the values, we can see that the points that are mis-classified have large values and the points that are correctly classified have small values.
And the reason for this is again that a correctly classified point will have a probability that is close to 1, which when we take the negative of the logarithm, we’ll get a small value. Thus, we can think of the negatives of these logarithms as errors at each point.
Points that are correctly classified will have small errors and points that are mis-classified will have large errors. In this manner, the Cross Entropy will tell us if a model is good or bad.
So, in other words, our goal has changed from maximizing the probability to minimizing the cross entropy, in order for us to get from the model in the left to the model in the right.
Cross Entropy is an Error Function.
Cross Entropy is kind of a big deal. Cross Entropy really says the following:
If I have a bunch of events(points) and a bunch of probabilities, how likely is it that those events happen based on the probabilities? If it is very likely, then cross entropy is small and if it is unlikely then the cross entropy is large.