Regularization

Regularization is any technique that discourages model complexity to prevent overfitting, related to Occam’s Razor. Used widely in Machine Learning.

The general form adds a penalty to the loss:

where:

  • is the regularization strength
  • penalizes model complexity (e.g. weight norm)

Intuition

The data term wants the model to fit the training set. The penalty term wants the model to stay simple. is the exchange rate. A flexible model usually has many weights that fit the training data equally well; the penalty picks the simplest one from that bunch, which is the one most likely to generalize. Equivalently (Bayesian view), is a prior over weights, and minimizing loss plus penalty is MAP estimation.

Parameter penalties (reduce effective model capacity):

  • L2 Regularization:
  • L1 Regularization (Lasso):
  • Elastic Net (L1 + L2):
  • Max-norm regularization

Training-time regularization:

Model/structure choices (capacity control):

  • Simpler model / fewer parameters
  • Feature selection / dimensionality reduction (PCA)
  • Ensembling (bagging, random forest): reduces variance

Linear models:

  • SVM margin (hinge loss + ) acts like regularization via margin/penalty tradeoff

Normalization (not strictly regularization but stabilizes / prevents overfitting):

L1 vs L2 preference

From the CS231n Lec 3 slides. Given , two weight vectors with the same dot product :

  • : L1 picks this (“sparse”)
  • : L2 picks this (“spread out”)

Both produce identical predictions on this ; the regularizer is what breaks the tie. Pick the penalty that matches your prior over what a “good” classifier looks like.

Why regularize?

  1. Express preferences over weights (which solution among the equivalent ones?)
  2. Make the model simple so it generalizes to test data
  3. Improve optimization by adding curvature (the loss landscape becomes more bowl-shaped)

From MATH213

Definition 2: Regularization

A function is the regularization of a function if

  1. For all in the domain of ,
  2. For all such that exists but is undefined,

Theorem 1

If is the regularization of a function that has a finite number of discontinuities then

Also see Finite Zeros and Finite Poles.