Softmax Classifier
The Softmax Classifier is a linear classifier whose loss interprets class scores as unnormalized log-probabilities. Itβs the multi-class generalization of logistic regression and the standard loss for modern classification networks.
Given linear scores , the softmax turns them into a probability distribution over classes:
The loss is the negative log probability of the true class β the cross-entropy loss:
Equivalently, .
Information-theoretic interpretation
Cross-entropy minimizes the KL divergence between the predicted distribution and the target one-hot distribution : Minimizing cross-entropy = pushing the modelβs predicted distribution toward the (degenerate) target distribution.
Sanity checks at initialization
If is initialized small ( for all ), then , so: For CIFAR-10 (), the loss should start near . If your softmax loss isnβt near on iteration 0, something is wrong.
- Min loss: (achieved only if , requiring and others β never reached)
- Max loss:
Numerical stability trick
overflows for large scores. Subtract the max score before exponentiating β the softmax is shift-invariant: This is what every framework does internally.
Difference with SVM / Hinge Loss
| SVM (multi-class hinge) | Softmax (cross-entropy) | |
|---|---|---|
| What scores mean | Margins | Unnormalized log-probabilities |
| Loss = 0 condition | Correct score > all others by margin (achievable) | Never (requires ) |
| Behavior near correct | Stops caring once margin met | Always pushing |
| Output interpretation | Just rankings | Calibrated-ish probabilities |
The βprobabilitiesβ softmax outputs depend on the regularization strength: stronger β smaller weights β softer (more uniform) distribution. So treat them as relative confidences, not true Bayesian probabilities.
Related
- Softmax Function
- Cross-Entropy Loss
- Logistic Regression
- Linear Classification
- Support Vector Machine
- Hinge Loss
- CS231n
Source
CS231n Lec 2 slides 70β83 (softmax, cross-entropy, log C sanity check, numerical stability, comparison with SVM).