Softmax Classifier

The Softmax Classifier is a linear classifier whose loss interprets class scores as unnormalized log-probabilities. It’s the multi-class generalization of logistic regression and the standard loss for modern classification networks.

Given linear scores , the softmax turns them into a probability distribution over classes:

The loss is the negative log probability of the true class β€” the cross-entropy loss:

Equivalently, .

Information-theoretic interpretation

Cross-entropy minimizes the KL divergence between the predicted distribution and the target one-hot distribution : Minimizing cross-entropy = pushing the model’s predicted distribution toward the (degenerate) target distribution.

Sanity checks at initialization

If is initialized small ( for all ), then , so: For CIFAR-10 (), the loss should start near . If your softmax loss isn’t near on iteration 0, something is wrong.

  • Min loss: (achieved only if , requiring and others β€” never reached)
  • Max loss:

Numerical stability trick

overflows for large scores. Subtract the max score before exponentiating β€” the softmax is shift-invariant: This is what every framework does internally.

Difference with SVM / Hinge Loss

SVM (multi-class hinge)Softmax (cross-entropy)
What scores meanMarginsUnnormalized log-probabilities
Loss = 0 conditionCorrect score > all others by margin (achievable)Never (requires )
Behavior near correctStops caring once margin metAlways pushing
Output interpretationJust rankingsCalibrated-ish probabilities

The β€œprobabilities” softmax outputs depend on the regularization strength: stronger β†’ smaller weights β†’ softer (more uniform) distribution. So treat them as relative confidences, not true Bayesian probabilities.

Source

CS231n Lec 2 slides 70–83 (softmax, cross-entropy, log C sanity check, numerical stability, comparison with SVM).