Hinge Loss

The hinge loss is the loss function used to train maximum-margin classifiers, most notably the SVM. It is zero once the correct prediction is “confident enough” (above a margin) and otherwise grows linearly with the violation.

Binary form

For an intended output $t \in {- 1, + 1}$ and a classifier score $y \in R$ : $ℓ (y) = max (0, 1 - t \cdot y)$

The loss is zero when $t \cdot y \geq 1$ : the score has the right sign and magnitude $\geq 1$ . Below that, you pay a linear penalty.

Intuition

Hinge stops caring once you are past the margin: a point correctly classified with score $5$ is as good as one with score $1$ , both give zero loss. Contrast with cross-entropy, which keeps pushing confidence forever. The “hinge” shape (flat then linear) means the gradient is zero for safe points and constant for unsafe ones, which is why only the margin violators (support vectors) drive the SVM solution.

Multi-class SVM loss

From CS231n Lec 2, given linear scores $s = W x + b \in R^{C}$ and true class $y_{i}$ , sum the hinge over each wrong class: $L_{i} = \sum_{j \neq = y_{i}} max (0, s_{j} - s_{y_{i}} + 1)$

The ” $+ 1$ ” is the margin, an arbitrary constant. Its value doesn’t matter because $W$ can rescale to absorb it; what matters is its presence (so that just-correct predictions still pay a penalty).

Properties:

Min: $0$ (achieved when correct class beats every other by $\geq 1$ )
Max: $+ \infty$
At init (small $W$ , all $s_{j} \approx 0$ ): $L_{i} \approx C - 1$ . So for CIFAR-10, the SVM loss should start near $9$ , a sanity check on iteration 0
Squared hinge $max (0, \cdot)^{2}$ is sometimes used; penalizes large violations more aggressively

Vectorized implementation

def L_i_vectorized(x, y, W):
    scores = W.dot(x)
    margins = np.maximum(0, scores - scores[y] + 1)
    margins[y] = 0  # don't include j = y_i in the sum
    loss_i = np.sum(margins)
    return loss_i

Non-uniqueness of the optimum

If $W$ achieves $L_{i} = 0$ for all $i$ , then so does $2 W$ , $10 W$ , etc.: the loss is invariant to positive scaling once it bottoms out. This is why you need regularization (typically $λ ∥ W ∥^{2}$ ) to break the tie and pick a “small” $W$ .

Difference with Softmax

Hinge stops caring once the margin is met; cross-entropy never stops pushing. In practice both work; cross-entropy is now standard because it composes cleanly with softmax and gives probabilistic outputs.

🛠️ Steven Gong

Table of Contents

Hinge Loss

Binary form

Multi-class SVM loss

Vectorized implementation

Difference with Softmax

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Hinge Loss

Binary form

Multi-class SVM loss

Vectorized implementation

Difference with Softmax

Related

Graph View

Backlinks