Cross-Entropy Loss

We use this loss for Classification problems. LLMs use this loss to be trained on predicting next tokens.

Cross-entropy measures the difference between two probability distributions.

Resources

https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss

The cross-entropy $H (p, q)$ between a “true” distribution $p$ and an estimated distribution $q$ is defined as:

H (p, q) = - E_{x \sim p} [lo g q (x)] = - x \sum p (x) lo g q (x)

Where does this log come from?

This comes from Shannon Entropy.

He proved that under certain very reasonable assumptions, log is the only possible choice.

For our ML classification problems, let $y$ be the correct class. We can simplify the cross-entropy equation $- x \sum p (x) lo g q (x) \to - lo g q (y)$

$p (x)$ vanishes into a One-Hot distribution (Kronecker Delta), since we have $p (x) = {10 x = y x \neq = y$ So we have $- x \sum p (x) lo g q (x) = - p (y) lo g q (y) = - lo g q (y)$

PyTorch Cross-Entropy Loss

https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html $ℓ (x, y) = L = {ℓ_{1}, \dots, ℓ_{N}}^{⊤}$

$N$ spans the minibatch dimension (i.e. there are $N$ samples in a batch). If this is already confusing to you, see a more basic example in L1 Loss.

There are 2 different formulations depending on how the classes are predicted.

Variables

$n$ is the $n$ -th example
$N$ is the number dimensions
$y_{n}$ is the correct class for the $n$ -th example, $y_{n} \in [0, C)$
$l_{n}$ is the loss for the $n$ -th example
$w_{y_{n}}$ is an optional weight matrix for the class $y_{n}$
$x_{n, c}$ is the value of the logit at index $c$ for the $n$ -th example

If you use indices as the target: where

ℓ_{n} = - w_{y_{n}} \cdot lo g (\frac{exp ( x _{n, y_{n}} )}{\sum _{c = 1}^{C} exp ( x _{n, c} )}) \cdot 1 {y_{n} \neq = ignore_index}

$softmax (x_{n}) = \frac{e x p ( x _{n, y_{n}} )}{\sum _{c = 1}^{C} e x p ( x _{n, c} )}$

$ℓ_{n} = - w_{y_{n}} lo g (softmax (x_{n, y_{n}}))$

The Negative Log Likelihood of the correct output

Summary

Model outputs logits → softmax(logits) → log(softmax(logits))

The cross entropy loss is calculated over the softmax.

Exercise:

Help explain how we derive the Behavior Cloning loss, and how we end up using the Cross-Entropy Loss, which is where the log comes from

KL Divergence

🛠️ Steven Gong

Table of Contents

Cross-Entropy Loss

PyTorch Cross-Entropy Loss

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Cross-Entropy Loss

PyTorch Cross-Entropy Loss

Related

Graph View

Backlinks