Classification

K-Nearest Neighbors (kNN)

kNN classifies a test point by finding its closest points in the training set and letting them vote on the label. There is no β€œtraining” stage: kNN is a non-parametric lazy learner that just memorizes the dataset.

With kNN, when we want to produce an output for a new test input , we find the -nearest neighbours to in the training data , and the neighbours β€œvote” on the label of . Majority wins for classification; for regression, average the labels.

See also https://cs231n.github.io/classification/.

Bayes Optimal Classifier

The optimal classification rule (minimum error) is the Bayes classifier:

This requires knowing the true data distribution . kNN is an approximation to the Bayes classifier that only uses the training data.

The kNN Assumption

If feature vectors and are close, then their labels are likely the same:

Distance and aggregation are underspecified: typically distance and majority vote.

Intuition

kNN is non-parametric because the decision boundary is implicit in the training points themselves, not summarized by a fixed-size weight vector. More data means a more flexible boundary. controls smoothness: gives a jagged boundary that memorizes noise, large smooths everything toward the majority class.

Time and Space Complexity

  • Training: time, space (just store the dataset). Compare with perceptron or linear regression at space
  • Test-time classification: naively for each query
  • Better options:
    • Voronoi diagrams: time but space (good in low dimensions)
    • Approximate nearest neighbours (e.g., FLANN, FAISS, HNSW), accuracy-speed tradeoff

Theoretical Guarantee (Cover-Hart β€˜67)

As : .

  • If Bayes error is 0, 1NN has 0 error (with infinite data)
  • If Bayes error is 0.1, 1NN has error

Remarkable: 1NN, despite its simplicity, is within a factor of 2 of the best possible classifier given infinite data. The catch (below) is that β€œinfinite” can mean exponential in .

Caveat

may have to be exponentially large in . This is the Curse of Dimensionality.

Practical Recipe

  1. Preprocess your data: normalize features to zero mean and unit variance
  2. For high-dimensional data, reduce dimensionality first with PCA, NCA, or Random Projection
  3. Sweep (odd values to avoid ties) and distance types (, , ) on validation or via Cross Validation
  4. Lock in hyperparameters, then report test accuracy once

Distance Metrics on Images

From CS231n Lec 2, kNN is introduced as the first classifier, not because it’s good for images (it isn’t), but because it makes the train/predict API concrete.

L1 (Manhattan) distance, coordinate-dependent (changes if you rotate the axes):

L2 (Euclidean) distance, rotation-invariant:

The choice of L1 vs L2 reshapes decision boundaries: L1 boundaries align with the coordinate axes, L2 boundaries don’t. Pick L1 when individual feature dimensions have semantic meaning (then axis-aligned splits are natural).

Hyperparameters: and the distance metric

and the distance metric are hyperparameters: choices about the algorithm itself, not learned from data. Bad ways to set them:

  1. Pick what works best on the training set, always picks (memorizes)
  2. Pick what works best on the test set, leaks the test set so you no longer have an unbiased generalization estimate

Right way: split into train / validation / test. Tune on validation, evaluate once on test. For small datasets, use -fold cross-validation.

Why kNN is never used for images in practice

  1. Test-time is slow, per query, exactly the wrong end of the tradeoff for deployment
  2. Pixel distances are not perceptually meaningful, the slides show four CIFAR images all at the same L2 distance from the original (one shifted, one tinted, one boxed) that a human would never confuse
  3. Curse of dimensionality, to densely cover a -dimensional unit cube you need exponential in . CIFAR is