Kernel Method
The kernel method lets linear algorithms (like SVM) operate in very high (even infinite-dimensional) feature spaces without ever explicitly computing the feature map. Many linear algorithms depend on the data only through dot products, and a kernel can compute in time independent of the feature space dimension.
What's the kernel trick buying us?
Nonlinear decision boundaries at linear-algorithm prices: skip the explicit high-dimensional lift and still get its expressive power.
Motivation: Nonlinear Decision Boundaries
Linear methods fail when the true boundary is curved (e.g., a circle, or XOR). Lift the data into a higher-dimensional space where it becomes linearly separable.
XOR
The map makes XOR separable: the product coordinate captures the interaction.
Feature Maps
A feature map sends to where (possibly ).
Quadratic feature map (example): for , take Now a linear classifier in -space captures quadratic boundaries.
The issue: computing explicitly for high-order polynomial or Gaussian feature maps is expensive or infeasible.
The Kernel Trick
A function is a kernel if there exists some feature map such that
Crucially, may be expensive (or infinite-dimensional) but may be cheap to evaluate.
Quadratic kernel example: .
where:
- is the quadratic feature map
Computing the kernel takes time instead of for the explicit map.
Intuition
You never materialize the high-dimensional feature vector. The kernel evaluates the inner product in the implicit space directly, so you get a decision boundary curved in the original space without paying the cost of explicit features. For the RBF kernel, is literally infinite-dimensional, and yet is a one-line exponential. Any algorithm that touches data only through dot products gets this upgrade for free.
Common Kernels
- Linear: (no lifting)
- Polynomial (degree ):
- Gaussian / Radial Basis Function (RBF): , corresponds to an infinite-dimensional feature map
The RBF kernel reads as a similarity score: 1 when , decaying smoothly to 0 as they move apart. is the bandwidth, how quickly similarity drops. Small means every point is βan islandβ (overfits), large smooths everything toward linear.
Valid Kernels: Mercerβs Condition
A function is a valid kernel iff, for any finite dataset , the Gram matrix with is
- Symmetric:
- Positive semidefinite (PSD): for all
PSD-ness follows from writing where has rows .
Kernelized SVM
Recall the SVM dual depends only on :
Replace with :
Predicting a new point : we donβt need explicitly:
Complexity
- Linear-kernel SVM: training, test time
- General-kernel SVM: to build the Gram matrix at training, test time (loop over support vectors)
So kernels arenβt free: thereβs a dataset-size penalty. Worth it when is small or moderate and the true boundary is nonlinear.
From CS480 lec6.
Related
- Support Vector Machine (canonical kernel application)
- Kernel Density Estimation
- CS480