Cross-Entropy Loss
We use this loss for Classification problems. LLMs use this loss to be trained on predicting next tokens.
Cross-entropy measures the difference between two probability distributions.
Resources
The cross-entropy between a “true” distribution and an estimated distribution is defined as:
Where does this log come from?
This comes from Shannon Entropy.
He proved that under certain very reasonable assumptions, log is the only possible choice.
In Classification
We can go from −x∑p(x)logq(x)→−logq(yn)
- p(x) vanishes into a One-Hot distribution
From pytorch
- spans the minibatch dimension (i.e. there are samples in a batch). If this is already confusing to you, see a more basic example in L1 Loss.
There are 2 different formulations depending on how the classes are predicted.
If you use indices as the target: where
This was really confusing to me.
- is an optional weight matrix, useful when you have an unbalanced training set
- is the value of the logit at index
- is the value of the logit as the correct index
- The Negative Log Likelihood of the correct output
Is this a similar idea of using the negative log to the Negative Log Likelihood??
Also see Softmax Classifier
If you use