Supervised Learning

Support Vector Machine (SVM)

The SVM is a linear classifier that finds the max-margin separating hyperplane, giving a unique, optimal solution (unlike perceptron). Its dual form depends only on dot products, which enables the kernel trick.

Resources:

Margins: Intuition

Given weight , the distance from a point to the hyperplane is . The margin of a dataset is the smallest such distance over all points (assuming all are correctly classified).

Perceptron finds any separator. SVM picks the one with the largest margin, which generalizes better.

Geometric picture

SVM finds the widest street separating the two classes. The decision hyperplane runs down the centerline; the margin is the curb-to-curb width. A wide street is robust: small perturbations in the data do not flip a point’s side. Without the margin requirement, the solution is not unique (infinitely many hyperplanes separate the same points), so SVM imposes a principled tiebreaker.

1. Hard-Margin SVM (Primal)

Assume the data is linearly separable. Maximize the margin:

Intuition

is the equation of the hyperplane we want to learn. The sign tells us which side lies on: if , same side as ; if , opposite side. So correct classification means (the ”” is a scale convention).

Minimizing maximizes the margin .

2. Soft-Margin SVM (Primal)

Real data is rarely linearly separable. Introduce slack variables that measure how far each point violates the margin.

Equivalently, in unconstrained form using hinge loss:

Interpreting the slack:

  • : on correct side of margin
  • : correctly classified but inside margin
  • : misclassified

controls the tradeoff: ignores the data (only minimizes ); recovers hard-margin SVM.

Read as β€œhow much do I punish misclassifications versus margin violations?” Small : a wide margin with many points inside it. Large : a narrow margin that tries hard to get every point right, possibly overfitting noise.

Note the similarity to ridge regression: the term is L2 regularization, and the hinge-loss sum is the data loss.

3. Dual Formulation

Convert constrained optimization to unconstrained via Lagrange multipliers. For hard-margin:

Strong duality lets us swap min and max. Setting and to 0 inside gives and . Substituting back:

(For hard-margin drop the upper bound ; for soft-margin keep it.)

Key observation

The dual depends on the data only through dot products . Replace with a kernel and you can operate in (infinite-dimensional) feature spaces for free.

Support Vectors

Complementary slackness (a KKT condition) implies for each , so either:

  • (point is on correct side of margin, doesn’t contribute), or
  • (point lies on the margin or violates it, a support vector)

The solution is a sum over support vectors:

Only support vectors define the hyperplane, a form of sparsity.

Intuition

Only the points right on the curb (support vectors) matter. Delete a point deep inside its class’s territory and the solution does not change. This is why SVMs are memory-efficient at inference (you only keep the support vectors) and why they are less sensitive to outliers that sit far from the boundary, though close-to-boundary outliers can still hijack the fit.

Optimization

  • Primal: gradient descent on the hinge-loss form. Subgradient: if , otherwise. Non-differentiable at the hinge
  • Dual: projected gradient descent, or SMO (sequential minimal optimization, what LIBSVM uses). Often more practical, especially with kernels

CS231n: Multiclass SVM Loss

For example with correct class :

where:

  • is the score for class
  • (typically 1) is the minimum margin

Full loss:

def L_i_vectorized(x, y, W):
    delta = 1.0
    scores = W.dot(x)
    margins = np.maximum(0, scores - scores[y] + delta)
    margins[y] = 0
    return np.sum(margins)

From CS480 lec5.