Linear Regression
Linear regression learns a linear predictor by minimizing squared residuals. It is the foundational supervised learning method: convex, closed-form, and the MLE under Gaussian noise.
Setup
Given with , (continuous labels, unlike classification).
Use the padding trick , so the line need not pass through the origin.
Empirical Risk Minimization (ERM)
The general learning goal is , but is unknown. ERM minimizes the empirical distribution instead:
This converges to the true risk as .
Squared Loss
Stacking feature vectors into and labels into :
Intuition
Squared loss punishes big misses far more than small ones: being off by 10 costs 100x as much as being off by 1, not 10x. This makes the fit chase outliers. L1 loss (absolute value) would be robust but non-differentiable; squared loss trades robustness for clean calculus.
Convexity
. The Hessian is (since ), so the loss is convex.
Normal Equations (Closed Form)
Setting :
If is invertible: . In practice, solve the linear system directly: the matrix inverse is slow and numerically imprecise for ill-conditioned .
Geometric picture
is the projection of onto the column space of . The residual is orthogonal to every column of (which is exactly what says). You are finding the closest point to that is reachable by a linear combination of features.
Why squared loss?
Squared loss falls out of Gaussian-noise MLE.
Assume where . Then .
Dropping constants:
Regularization
Pure least squares can overfit, especially with or collinear features.
Ridge regression (Tikhonov): penalize norm of weights. Closed form: . Always invertible for .
Ridge shrinks weights toward zero, spreading influence across correlated features instead of letting one spike. The adds a floor to the eigenvalues of , which is why ill-conditioned problems become solvable.
Lasso: penalize norm, prefers sparse solutions.
The ball has corners on the axes. The squared-loss contours first touch it at a corner (with high probability), and a corner means some coordinates are exactly zero. That is why Lasso does feature selection while ridge does not.
See Regularization, L1 Regularization, L2 Regularization, Weight Decay.
Hyperparameter Selection
Pick via a held-out validation set, or Cross Validation if no separate validation set is available. Often regularization is turned off at validation/test time: train with but score on alone.
From CS480 lec2.
Related
- Logistic Regression (classification analogue via logit transform)
- Gradient Descent (alternative to normal equations when is huge)
- MLE
- Cross Validation
- CS480