Negative Log Likelihood

First, understand likelihood and understand that likelihood is just Joint Probability of the data given model parameters $θ$ , but viewed as a function of $θ$ , i.e. $L (θ)$ .

One way to do Maximum Likelihood Estimation is to minimize the negative log likelihood.

That is $max_{θ} E_{x \sim p_{d a t a}} [lo g p_{θ} (x)] = min_{θ} - E_{x \sim p_{d a t a}} [lo g p_{θ} (x)]$

Log Likelihood

Log likelihood makes the math a lot easier: log(a*b*c) = log(a) + log(b) + log(c)

We do this because a*b*c might be an extremely small number, so we perform addition instead.

Likelihood:

L (θ) = i = 1 \prod n p (x_{i} ∣ θ) = E ?

❌ Not expressible as expectation

Log-likelihood (Average): $lo g L (θ) = \frac{1}{n} \sum_{i = 1}^{n} lo g p (x_{i} ∣ θ) = E_{x \sim p_{d a t a}} [lo g p (x ∣ θ)]$

Negative log-likelihood (Average): $- lo g L (θ) = - \frac{1}{n} \sum_{i = 1}^{n} lo g p (x_{i} ∣ θ) = - E_{x \sim p_{d a t a}} [lo g p (x ∣ θ)]$

Can we not just minimize negative likelihood instead?

Like do we HAVE to use the log? We would not be able to do Cross-Entropy loss because its not consistent with their formula (if you use log it is). But we can somehow derive another formula for the optimization?

In theory yes, but in practice makes your life 10x harder at the moment?

In theory yes, but in practice makes your life 10x harder at the moment.

The issue is that you’d be optimization

- L (θ) = - i = 1 \prod n p (x_{i} ∣ θ)

Multiplying the probabilities is going to result in numerical stability issues.

Connection with Cross-Entropy

You will notice that the formula is actually the same as the Cross-Entropy definition! $min_{θ} - E_{x \sim p_{d a t a}} [lo g p_{θ} (x)] = H (p_{d a t a}, p_{θ})$ Note that $p_{θ} (x) = p (x ∣ θ)$ (I use both notations).

In PyTorch, we have both CrossEntropyLoss and NLLLoss. They’re actually the same thing, but NLLLoss expects the input to already be softmaxed (probability computed) and logged.

LogSoftmax layer + NLLLoss == CrossEntropyLoss

Optimization Landscape

Was trying to understand how using log likelihoods help with the optimization landscape as well.

🔧 Setup: Binary Classification Toy Example

Let’s imagine a simple model outputs a probability $p_{θ} (x) \in (0, 1)$ of a binary label being 1. Suppose the true label is $y = 1$ . We’ll compare two loss functions: $L_{raw} = - p_{θ} (x)$ and $L_{log} = - lo g p_{θ} (x)$ .

Now we’ll plot both loss functions as a function of the predicted probability $p_{θ} (x) \in (0, 1)$ .

You can see that the gradients of -log(p) are much smoother.

In Robotics

See Behavior Cloning. It’s essentially Cross-Entropy Loss.

🛠️ Steven Gong

Table of Contents

Negative Log Likelihood

Log Likelihood

Connection with Cross-Entropy

Optimization Landscape

🔧 Setup: Binary Classification Toy Example

In Robotics

Graph View

Backlinks