Cross-Entropy Loss

We use this loss for Classification problems. LLMs use this loss to be trained on predicting next tokens.

Cross-entropy measures the difference between two probability distributions.

Resources

The cross-entropy between a “true” distribution  and an estimated distribution  is defined as:

Where does this log come from?

This comes from Shannon Entropy.

He proved that under certain very reasonable assumptions, log is the only possible choice.

In Classification

We can go from −x∑​p(x)logq(x)→−logq(yn​)

  • p(x) vanishes into a One-Hot distribution

From pytorch

  •  spans the minibatch dimension (i.e. there are samples in a batch). If this is already confusing to you, see a more basic example in L1 Loss.

There are 2 different formulations depending on how the classes are predicted.

If you use indices as the target: where

This was really confusing to me.

  • is an optional weight matrix, useful when you have an unbalanced training set
  • is the value of the logit at index
  • is the value of the logit as the correct index

Is this a similar idea of using the negative log to the Negative Log Likelihood??

Also see Softmax Classifier

If you use