Cross-Entropy Loss
We use this loss for Classification problems. LLMs use this loss to be trained on predicting next tokens.
Cross-entropy measures the difference between two probability distributions.
Resources
TheΒ cross-entropyΒ between a βtrueβ distributionΒ Β and an estimated distributionΒ Β is defined as:
Where does this log come from?
This comes from Shannon Entropy.
He proved that under certain very reasonable assumptions, log is the only possible choice.
For our ML classification problems, let be the correct class. We can simplify the cross-entropy equation
- vanishes into a One-Hot distribution (Kronecker Delta), since we have So we have
PyTorch Cross-Entropy Loss
https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
- Β spans the minibatch dimension (i.e. there are samples in a batch). If this is already confusing to you, see a more basic example in L1 Loss.
There are 2 different formulations depending on how the classes are predicted.
Variables
- is the -th example
- is the number dimensions
- is the correct class for the -th example,
- is the loss for the -th example
- is an optional weight matrix for the class
- is the value of the logit at index for the -th example
If you use indices as the target: where
- The Negative Log Likelihood of the correct output
Is this a similar idea of using the negative log to the Negative Log Likelihood??
Also see Softmax Classifier