Linear Regression

Linear regression learns a linear predictor $\overset{y}{^} = ⟨ w, x ⟩$ by minimizing squared residuals. It is the foundational supervised learning method: convex, closed-form, and the MLE under Gaussian noise.

Setup

Given $(x_{1}, y_{1}), \dots, (x_{n}, y_{n}) \sim_{i . i . d .} P$ with $x_{i} \in R^{d}$ , $y_{i} \in R$ (continuous labels, unlike classification).

Use the padding trick $x \to (x, 1)$ , $w \to (w, b)$ so the line need not pass through the origin.

Empirical Risk Minimization (ERM)

The general learning goal is $ar g min_{w} E_{(x, y) \sim P} [ℓ_{w} (x, y)]$ , but $P$ is unknown. ERM minimizes the empirical distribution instead:

$ar g min_{w} \frac{1}{n} \sum_{i = 1}^{n} ℓ_{w} (x_{i}, y_{i})$

This converges to the true risk as $n \to \infty$ .

Squared Loss

$ℓ_{w} (x, y) = (y - ⟨ w, x ⟩)^{2}$

Stacking feature vectors into $A \in R^{n \times d}$ and labels into $z \in R^{n}$ :

$L (w) = ∥ A w - z ∥_{2}^{2}$

Intuition

Squared loss punishes big misses far more than small ones: being off by 10 costs 100x as much as being off by 1, not 10x. This makes the fit chase outliers. L1 loss (absolute value) would be robust but non-differentiable; squared loss trades robustness for clean calculus.

Convexity

$∥ A w - z ∥_{2}^{2} = w^{⊤} A^{⊤} A w - 2 w^{⊤} A^{⊤} z + z^{⊤} z$ . The Hessian is $\nabla_{w}^{2} ∥ A w - z ∥_{2}^{2} = 2 A^{⊤} A ⪰ 0$ (since $v^{⊤} A^{⊤} A v = ∥ A v ∥_{2}^{2} \geq 0$ ), so the loss is convex.

Normal Equations (Closed Form)

Setting $\nabla_{w} ∥ A w - z ∥_{2}^{2} = 2 A^{⊤} A w - 2 A^{⊤} z = 0$ :

$A^{⊤} A \overset{w}{^} = A^{⊤} z$

If $A^{⊤} A$ is invertible: $\overset{w}{^} = (A^{⊤} A)^{- 1} A^{⊤} z$ . In practice, solve the linear system directly: the matrix inverse is slow and numerically imprecise for ill-conditioned $A^{⊤} A$ .

Geometric picture

$\overset{w}{^}$ is the projection of $z$ onto the column space of $A$ . The residual $A \overset{w}{^} - z$ is orthogonal to every column of $A$ (which is exactly what $A^{⊤} (A \overset{w}{^} - z) = 0$ says). You are finding the closest point to $z$ that is reachable by a linear combination of features.

Why squared loss?

Squared loss falls out of Gaussian-noise MLE.

Assume $y = ⟨ w, x ⟩ + z$ where $z \sim N (0, σ^{2})$ . Then $y ∣ x, w \sim N (⟨ w, x ⟩, σ^{2})$ .

$\overset{w}{^} = ar g max_{w} \prod_{i} Pr [y_{i} ∣ x_{i}, w] = ar g max_{w} \sum_{i} lo g \frac{1}{2 π σ} exp (- \frac{( y _{i} - ⟨ w , x _{i} ⟩ ) ^{2}}{2 σ ^{2}})$

Dropping constants:

$\overset{w}{^} = ar g min_{w} \sum_{i} (y_{i} - ⟨ w, x_{i} ⟩)^{2}$

Regularization

Pure least squares can overfit, especially with $d ≳ n$ or collinear features.

Ridge regression (Tikhonov): penalize $ℓ_{2}$ norm of weights. $ar g min_{w} ∥ A w - z ∥_{2}^{2} + λ ∥ w ∥_{2}^{2}$ Closed form: $\overset{w}{^} = (A^{⊤} A + λ I)^{- 1} A^{⊤} z$ . Always invertible for $λ > 0$ .

Ridge shrinks weights toward zero, spreading influence across correlated features instead of letting one spike. The $+ λ I$ adds a floor to the eigenvalues of $A^{⊤} A$ , which is why ill-conditioned problems become solvable.

Lasso: penalize $ℓ_{1}$ norm, prefers sparse solutions. $ar g min_{w} ∥ A w - z ∥_{2}^{2} + λ ∥ w ∥_{1}$

The $ℓ_{1}$ ball has corners on the axes. The squared-loss contours first touch it at a corner (with high probability), and a corner means some coordinates are exactly zero. That is why Lasso does feature selection while ridge does not.

See Regularization, L1 Regularization, L2 Regularization, Weight Decay.

Hyperparameter Selection

Pick $λ$ via a held-out validation set, or Cross Validation if no separate validation set is available. Often regularization is turned off at validation/test time: train with $∥ A w - z ∥_{2}^{2} + λ ∥ w ∥_{2}^{2}$ but score on $∥ A w - z ∥_{2}^{2}$ alone.

From CS480 lec2.

Logistic Regression (classification analogue via logit transform)
Gradient Descent (alternative to normal equations when $A^{⊤} A$ is huge)
MLE
Cross Validation
CS480

🛠️ Steven Gong

Table of Contents

Linear Regression

Setup

Empirical Risk Minimization (ERM)

Squared Loss

Convexity

Normal Equations (Closed Form)

Regularization

Hyperparameter Selection

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Linear Regression

Setup

Empirical Risk Minimization (ERM)

Squared Loss

Convexity

Normal Equations (Closed Form)

Regularization

Hyperparameter Selection

Related

Graph View

Backlinks