Support Vector Machine (SVM)
The SVM is a linear classifier that finds the max-margin separating hyperplane, giving a unique, optimal solution (unlike perceptron). Its dual form depends only on dot products, which enables the kernel trick.
Resources:
- https://www.quarkml.com/2022/09/primal-formulation-of-svm-simplified.html
- StatQuest: https://www.youtube.com/watch?v=efR1C6CvhmE&ab_channel=StatQuestwithJoshStarmer
Margins: Intuition
Given weight , the distance from a point to the hyperplane is . The margin of a dataset is the smallest such distance over all points (assuming all are correctly classified).
Perceptron finds any separator. SVM picks the one with the largest margin, which generalizes better.
Geometric picture
SVM finds the widest street separating the two classes. The decision hyperplane runs down the centerline; the margin is the curb-to-curb width. A wide street is robust: small perturbations in the data do not flip a pointβs side. Without the margin requirement, the solution is not unique (infinitely many hyperplanes separate the same points), so SVM imposes a principled tiebreaker.
1. Hard-Margin SVM (Primal)
Assume the data is linearly separable. Maximize the margin:
Intuition
is the equation of the hyperplane we want to learn. The sign tells us which side lies on: if , same side as ; if , opposite side. So correct classification means (the ββ is a scale convention).
Minimizing maximizes the margin .
2. Soft-Margin SVM (Primal)
Real data is rarely linearly separable. Introduce slack variables that measure how far each point violates the margin.
Equivalently, in unconstrained form using hinge loss:
Interpreting the slack:
- : on correct side of margin
- : correctly classified but inside margin
- : misclassified
controls the tradeoff: ignores the data (only minimizes ); recovers hard-margin SVM.
Read as βhow much do I punish misclassifications versus margin violations?β Small : a wide margin with many points inside it. Large : a narrow margin that tries hard to get every point right, possibly overfitting noise.
Note the similarity to ridge regression: the term is L2 regularization, and the hinge-loss sum is the data loss.
3. Dual Formulation
Convert constrained optimization to unconstrained via Lagrange multipliers. For hard-margin:
Strong duality lets us swap min and max. Setting and to 0 inside gives and . Substituting back:
(For hard-margin drop the upper bound ; for soft-margin keep it.)
Key observation
The dual depends on the data only through dot products . Replace with a kernel and you can operate in (infinite-dimensional) feature spaces for free.
Support Vectors
Complementary slackness (a KKT condition) implies for each , so either:
- (point is on correct side of margin, doesnβt contribute), or
- (point lies on the margin or violates it, a support vector)
The solution is a sum over support vectors:
Only support vectors define the hyperplane, a form of sparsity.
Intuition
Only the points right on the curb (support vectors) matter. Delete a point deep inside its classβs territory and the solution does not change. This is why SVMs are memory-efficient at inference (you only keep the support vectors) and why they are less sensitive to outliers that sit far from the boundary, though close-to-boundary outliers can still hijack the fit.
Optimization
- Primal: gradient descent on the hinge-loss form. Subgradient: if , otherwise. Non-differentiable at the hinge
- Dual: projected gradient descent, or SMO (sequential minimal optimization, what LIBSVM uses). Often more practical, especially with kernels
CS231n: Multiclass SVM Loss
For example with correct class :
where:
- is the score for class
- (typically 1) is the minimum margin
Full loss:
def L_i_vectorized(x, y, W):
delta = 1.0
scores = W.dot(x)
margins = np.maximum(0, scores - scores[y] + delta)
margins[y] = 0
return np.sum(margins)From CS480 lec5.
Related
- Perceptron (finds any separator; SVM picks max-margin)
- Hinge Loss
- Kernel Method (key extension, enables nonlinear SVMs)
- Hyperplane
- Lagrange Multipliers
- Regularization
- CS480