Kernel Method

The kernel method lets linear algorithms (like SVM) operate in very high (even infinite-dimensional) feature spaces without ever explicitly computing the feature map. Many linear algorithms depend on the data only through dot products, and a kernel $k (x, x^{'})$ can compute $⟨ ϕ (x), ϕ (x^{'})⟩$ in time independent of the feature space dimension.

What's the kernel trick buying us?

Nonlinear decision boundaries at linear-algorithm prices: skip the explicit high-dimensional lift and still get its expressive power.

Motivation: Nonlinear Decision Boundaries

Linear methods fail when the true boundary is curved (e.g., a circle, or XOR). Lift the data into a higher-dimensional space where it becomes linearly separable.

XOR

The map $ϕ (x) = (x_{1}, x_{2}, x_{1} x_{2})$ makes XOR separable: the product coordinate captures the interaction.

Feature Maps

A feature map sends $x \in R^{d}$ to $ϕ (x) \in R^{D}$ where $D ≫ d$ (possibly $D = \infty$ ).

Quadratic feature map (example): for $x = (x_{1}, x_{2})$ , take $ϕ (x) = (x_{1}^{2}, 2 x_{1} x_{2}, x_{2}^{2})$ Now a linear classifier in $ϕ$ -space captures quadratic boundaries.

The issue: computing $ϕ (x)$ explicitly for high-order polynomial or Gaussian feature maps is expensive or infeasible.

The Kernel Trick

A function $k (x, x^{'})$ is a kernel if there exists some feature map $ϕ$ such that $k (x, x^{'}) = ⟨ ϕ (x), ϕ (x^{'})⟩$

Crucially, $ϕ$ may be expensive (or infinite-dimensional) but $k$ may be cheap to evaluate.

Quadratic kernel example: $k (x, x^{'}) = ⟨ x, x^{'} ⟩^{2}$ . $⟨ x, x^{'} ⟩^{2} = (\sum_{i} x_{i} x_{i}^{'})^{2} = \sum_{i, j} x_{i} x_{j} x_{i}^{'} x_{j}^{'} = ⟨ ϕ (x), ϕ (x^{'})⟩$

where:

$ϕ (x)_{ij} = x_{i} x_{j}$ is the quadratic feature map

Computing the kernel takes $O (d)$ time instead of $O (d^{2})$ for the explicit map.

Intuition

You never materialize the high-dimensional feature vector. The kernel evaluates the inner product in the implicit space directly, so you get a decision boundary curved in the original space without paying the cost of explicit features. For the RBF kernel, $ϕ$ is literally infinite-dimensional, and yet $k (x, x^{'})$ is a one-line exponential. Any algorithm that touches data only through dot products gets this upgrade for free.

Common Kernels

Linear: $k (x, x^{'}) = ⟨ x, x^{'} ⟩$ (no lifting)
Polynomial (degree $p$ ): $k (x, x^{'}) = (⟨ x, x^{'} ⟩ + c)^{p}$
Gaussian / Radial Basis Function (RBF): $k (x, x^{'}) = exp (- ∥ x - x^{'} ∥_{2}^{2} / (2 σ^{2}))$ , corresponds to an infinite-dimensional feature map

The RBF kernel reads as a similarity score: 1 when $x = x^{'}$ , decaying smoothly to 0 as they move apart. $σ$ is the bandwidth, how quickly similarity drops. Small $σ$ means every point is “an island” (overfits), large $σ$ smooths everything toward linear.

Valid Kernels: Mercer’s Condition

A function $k$ is a valid kernel iff, for any finite dataset ${x_{1}, \dots, x_{n}}$ , the Gram matrix $K \in R^{n \times n}$ with $K_{ij} = k (x_{i}, x_{j})$ is

Symmetric: $K_{ij} = K_{ji}$
Positive semidefinite (PSD): $v^{⊤} K v \geq 0$ for all $v \in R^{n}$

PSD-ness follows from writing $K = Φ Φ^{⊤}$ where $Φ$ has rows $ϕ (x_{i})$ .

Kernelized SVM

Recall the SVM dual depends only on $⟨ x_{i}, x_{j} ⟩$ : $min_{0 \leq α \leq C} \frac{1}{2} \sum_{i, j} α_{i} α_{j} y_{i} y_{j} ⟨ x_{i}, x_{j} ⟩ - \sum_{i} α_{i}$

Replace $⟨ x_{i}, x_{j} ⟩$ with $k (x_{i}, x_{j})$ : $min_{0 \leq α \leq C} \frac{1}{2} \sum_{i, j} α_{i} α_{j} y_{i} y_{j} k (x_{i}, x_{j}) - \sum_{i} α_{i}$

Predicting a new point $x$ : we don’t need $ϕ (x)$ explicitly: $\overset{y}{^} (x) = sign (\sum_{i} α_{i} y_{i} k (x_{i}, x) + b)$

Complexity

Linear-kernel SVM: $O (n d)$ training, $O (d)$ test time
General-kernel SVM: $O (n^{2} d)$ to build the Gram matrix at training, $O (n d)$ test time (loop over support vectors)

So kernels aren’t free: there’s a dataset-size penalty. Worth it when $d$ is small or moderate and the true boundary is nonlinear.

From CS480 lec6.

Support Vector Machine (canonical kernel application)
Kernel Density Estimation
CS480

🛠️ Steven Gong

Table of Contents

Kernel Method

Motivation: Nonlinear Decision Boundaries

Feature Maps

The Kernel Trick

Common Kernels

Valid Kernels: Mercer’s Condition

Kernelized SVM

Complexity

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Kernel Method

Motivation: Nonlinear Decision Boundaries

Feature Maps

The Kernel Trick

Common Kernels

Valid Kernels: Mercer’s Condition

Kernelized SVM

Complexity

Related

Graph View

Backlinks