Supervised Learning

Linear Regression

Linear regression learns a linear predictor by minimizing squared residuals. It is the foundational supervised learning method: convex, closed-form, and the MLE under Gaussian noise.

Setup

Given with , (continuous labels, unlike classification).

Use the padding trick , so the line need not pass through the origin.

Empirical Risk Minimization (ERM)

The general learning goal is , but is unknown. ERM minimizes the empirical distribution instead:

This converges to the true risk as .

Squared Loss

Stacking feature vectors into and labels into :

Intuition

Squared loss punishes big misses far more than small ones: being off by 10 costs 100x as much as being off by 1, not 10x. This makes the fit chase outliers. L1 loss (absolute value) would be robust but non-differentiable; squared loss trades robustness for clean calculus.

Convexity

. The Hessian is (since ), so the loss is convex.

Normal Equations (Closed Form)

Setting :

If is invertible: . In practice, solve the linear system directly: the matrix inverse is slow and numerically imprecise for ill-conditioned .

Geometric picture

is the projection of onto the column space of . The residual is orthogonal to every column of (which is exactly what says). You are finding the closest point to that is reachable by a linear combination of features.

Why squared loss?

Squared loss falls out of Gaussian-noise MLE.

Assume where . Then .

Dropping constants:

Regularization

Pure least squares can overfit, especially with or collinear features.

Ridge regression (Tikhonov): penalize norm of weights. Closed form: . Always invertible for .

Ridge shrinks weights toward zero, spreading influence across correlated features instead of letting one spike. The adds a floor to the eigenvalues of , which is why ill-conditioned problems become solvable.

Lasso: penalize norm, prefers sparse solutions.

The ball has corners on the axes. The squared-loss contours first touch it at a corner (with high probability), and a corner means some coordinates are exactly zero. That is why Lasso does feature selection while ridge does not.

See Regularization, L1 Regularization, L2 Regularization, Weight Decay.

Hyperparameter Selection

Pick via a held-out validation set, or Cross Validation if no separate validation set is available. Often regularization is turned off at validation/test time: train with but score on alone.

From CS480 lec2.