It?s a cost function that is used as loss for machine learning models, telling us how bad it?s performing, the lower the better.

I?m going to explain it word by word, hopefully that will make it. easier to understand.

Negative: obviously means multiplying by -1. What? The loss of our model. Most machine learning frameworks only have minimization optimizations, but we want to maximize the probability of choosing the correct category.

We can maximize by minimizing the negative log likelihood, there you have it, we want somehow to maximize by minimizing.

Also it?s much easier to reason about the loss this way, to be consistent with the rule of loss functions approaching 0 as the model gets better.

Log: as explained later we are calculating the product of a number of things. Also if you are lucky you remember that log(a*b) = log(a)+log(b)

Why we want to wrap everything with a logarithm? Computers are capable of almost anything, except exact numeric representation.

It can happen that after multiplying things together you will end up losing precision if the numbers are too high or too low. By using the log of a number like 1e-100, the log becomes something close to -230, much easier to be represented by a computer!!

Better to add -230 than to multiply by 1e-100.

You can find another example of numerical stability here https://stackoverflow.com/questions/42599498/numercially-stable-softmax

Likelihood: isn?t it the same as probability? The meaning of the word is quite similar right? As with many things statistician needs to be precise to define concepts:

Likelihood refers to the chances of some calculated parameters producing some known data.

That makes sense as in machine learning we are interested in obtaining some parameters to match the pattern inherent to the data, the data is fixed, the parameters aren?tduringtraining.

Typically a model will output a set of probabilities(like[0.1, 0.3,0.5,0.1]), how does it relates with the likelihood? We are using NLL as the loss and the model outputs probabilities, but we said they mean something different.

Oooook… then how do they play together? Well, to calculate the likelihood we have to use the probabilities. To continue with the example above, imagine for some input we got the following probabilities: [0.1, 0.3, 0.5, 0.1], 4 possible classes. If the true answer would be the forth class, as a vector [0, 0, 0, 1], the likelihood of the current state of the model producing the input is:

0*0.3 + 0*0.1 + 0*0.5 + 1*0.1 = 0.1.

NLL: -ln(0.1) = 2.3

Instead, if the correct category would have been the third class [0, 0, 1, 0]:

0*0.3 + 0*0.1 + 1*0.5 + 0*0.1 = 0.5

NLL: -ln(0.5) = 0.69

Take a breath and look at the values obtained by using the logarithm and multiplying by -1.

You see? The better the prediction the lower the NLL loss, exactly what we want! And same way works for other losses, the better the output, the lower the loss.

## When to use it?

When doing classification.

## Can I use it for binary classification?

Yes, of course, but usually frameworks have it?s own binary classification loss functions.

## Can I use it for multi-label classification?

Yes, you can. Take a look on this article about the different ways to name cross entropy loss. Hold on! ?cross entropy loss?. What?s that? From wikipedia:

[?]so that maximizing the likelihood is the same as minimizing the cross entropy[?]

https://en.wikipedia.org/wiki/Cross_entropy

In practice it?s the same thing.