Negative Log Likelihood
First, understand likelihood and understand that likelihood is just Joint Probability of the data given model parameters , but viewed as a function of , i.e. .
One way to do Maximum Likelihood Estimation is to minimize the negative log likelihood.
That is
Log Likelihood
Log likelihood makes the math a lot easier:
log(a*b*c) = log(a) + log(b) + log(c)
We do this because a*b*c
might be an extremely small number, so we perform addition instead.
Likelihood:
- ❌ Not expressible as expectation
Log-likelihood (Average):
Negative log-likelihood (Average):
Can we not just minimize negative likelihood instead?
Like do we HAVE to use the log? We would not be able to do Cross-Entropy loss because its not consistent with their formula (if you use log it is). But we can somehow derive another formula for the optimization?
- In theory yes, but in practice makes your life 10x harder at the moment?
- In theory yes, but in practice makes your life 10x harder at the moment.
The issue is that you’d be optimization
Multiplying the probabilities is going to result in numerical stability issues.
Connection with Cross-Entropy
You will notice that the formula is actually the same as the Cross-Entropy definition! Note that (I use both notations).
In PyTorch, we have both CrossEntropyLoss and NLLLoss. They’re actually the same thing, but NLLLoss expects the input to already be softmaxed (probability computed) and logged.
LogSoftmax
layer +NLLLoss
==CrossEntropyLoss
Optimization Landscape
Was trying to understand how using log likelihoods help with the optimization landscape as well.
🔧 Setup: Binary Classification Toy Example
Let’s imagine a simple model outputs a probability of a binary label being 1. Suppose the true label is . We’ll compare two loss functions: and .
Now we’ll plot both loss functions as a function of the predicted probability .
You can see that the gradients of -log(p) are much smoother.
In Robotics
See Behavior Cloning. It’s essentially Cross-Entropy Loss.