Kernel Method

The kernel method lets linear algorithms (like SVM) operate in very high (even infinite-dimensional) feature spaces without ever explicitly computing the feature map. Many linear algorithms depend on the data only through dot products, and a kernel can compute in time independent of the feature space dimension.

What's the kernel trick buying us?

Nonlinear decision boundaries at linear-algorithm prices: skip the explicit high-dimensional lift and still get its expressive power.

Motivation: Nonlinear Decision Boundaries

Linear methods fail when the true boundary is curved (e.g., a circle, or XOR). Lift the data into a higher-dimensional space where it becomes linearly separable.

XOR

The map makes XOR separable: the product coordinate captures the interaction.

Feature Maps

A feature map sends to where (possibly ).

Quadratic feature map (example): for , take Now a linear classifier in -space captures quadratic boundaries.

The issue: computing explicitly for high-order polynomial or Gaussian feature maps is expensive or infeasible.

The Kernel Trick

A function is a kernel if there exists some feature map such that

Crucially, may be expensive (or infinite-dimensional) but may be cheap to evaluate.

Quadratic kernel example: .

where:

  • is the quadratic feature map

Computing the kernel takes time instead of for the explicit map.

Intuition

You never materialize the high-dimensional feature vector. The kernel evaluates the inner product in the implicit space directly, so you get a decision boundary curved in the original space without paying the cost of explicit features. For the RBF kernel, is literally infinite-dimensional, and yet is a one-line exponential. Any algorithm that touches data only through dot products gets this upgrade for free.

Common Kernels

  • Linear: (no lifting)
  • Polynomial (degree ):
  • Gaussian / Radial Basis Function (RBF): , corresponds to an infinite-dimensional feature map

The RBF kernel reads as a similarity score: 1 when , decaying smoothly to 0 as they move apart. is the bandwidth, how quickly similarity drops. Small means every point is β€œan island” (overfits), large smooths everything toward linear.

Valid Kernels: Mercer’s Condition

A function is a valid kernel iff, for any finite dataset , the Gram matrix with is

  1. Symmetric:
  2. Positive semidefinite (PSD): for all

PSD-ness follows from writing where has rows .

Kernelized SVM

Recall the SVM dual depends only on :

Replace with :

Predicting a new point : we don’t need explicitly:

Complexity

  • Linear-kernel SVM: training, test time
  • General-kernel SVM: to build the Gram matrix at training, test time (loop over support vectors)

So kernels aren’t free: there’s a dataset-size penalty. Worth it when is small or moderate and the true boundary is nonlinear.

From CS480 lec6.