Empirical Risk Minimization (ERM)
I think the idea is actually really simple, replace the theoretical with empirical values, and it works by Law of Large Numbers.
Goal (ideal): find parameters that minimize the expected risk
- Problem: the true distribution is unknown; we only have samples .
What we do instead
- Use the empirical distribution defined by the dataset and minimize the empirical risk:
Why this makes sense
- By the Law of Large Numbers
meaning that as we collect more samples, the empirical risk approaches the true expected risk.
- Intuition: with enough data and a well-chosen hypothesis class, minimizing empirical risk approximates minimizing true risk.
Notes
- Hypothesis class: restricts model complexity and helps generalization.
- Regularized ERM (Structural Risk Minimization):
where penalizes complexity (e.g., ).
- Common losses: MSE for regression, cross-entropy for classification, hinge loss for SVMs.
Example
- Linear regression:
- ERM → minimize