🛠️ Steven Gong

Search

Empirical Risk Minimization (ERM)
What we do instead
Why this makes sense
Notes
Example

Feb 11, 2026, 1 min read

Empirical Risk Minimization (ERM)

I think the idea is actually really simple, replace the theoretical with empirical values, and it works by Law of Large Numbers.

Goal (ideal): find parameters $w$ that minimize the expected risk

w^{⋆} = ar g w min E_{(x, y) \sim P} [ℓ_{w} (x, y)] .

Problem: the true distribution $P$ is unknown; we only have samples ${(x_{i}, y_{i})}_{i = 1}^{n} \sim P$ .

What we do instead

Use the empirical distribution defined by the dataset and minimize the empirical risk:

\hat{R}_{n} (w) = \frac{1}{n} i = 1 \sum n ℓ_{w} (x_{i}, y_{i}), \overset{w}{^} = ar g w min \hat{R}_{n} (w) .

Why this makes sense

By the Law of Large Numbers

\hat{R}_{n} (w) n \to \infty R (w) := E_{(x, y) \sim P} [ℓ_{w} (x, y)]

meaning that as we collect more samples, the empirical risk approaches the true expected risk.

Intuition: with enough data and a well-chosen hypothesis class, minimizing empirical risk approximates minimizing true risk.

Notes

Hypothesis class: $H = {f_{w}}$ restricts model complexity and helps generalization.
Regularized ERM (Structural Risk Minimization):

\overset{w}{^} = ar g w min [\hat{R}_{n} (w) + λ Ω (w)],

where $Ω (w)$ penalizes complexity (e.g., $∥ w ∥_{2}^{2}$ ).

Common losses: MSE for regression, cross-entropy for classification, hinge loss for SVMs.

Example

Linear regression: $ℓ_{w} (x_{i}, y_{i}) = (w^{⊤} x_{i} - y_{i})^{2}$
ERM → minimize $\frac{1}{n} \sum_{i} (w^{⊤} x_{i} - y_{i})^{2}$

Graph View

Backlinks

No backlinks found

Created with Quartz, © 2026

Blog
LinkedIn
Twitter
GitHub