Empirical Risk Minimization (ERM)

I think the idea is actually really simple, replace the theoretical with empirical values, and it works by Law of Large Numbers.

Goal (ideal): find parameters that minimize the expected risk

  • Problem: the true distribution is unknown; we only have samples .

What we do instead

  • Use the empirical distribution defined by the dataset and minimize the empirical risk:

Why this makes sense

meaning that as we collect more samples, the empirical risk approaches the true expected risk.

  • Intuition: with enough data and a well-chosen hypothesis class, minimizing empirical risk approximates minimizing true risk.

Notes

  • Hypothesis class: restricts model complexity and helps generalization.
  • Regularized ERM (Structural Risk Minimization):

where penalizes complexity (e.g., ).

  • Common losses: MSE for regression, cross-entropy for classification, hinge loss for SVMs.

Example

  • Linear regression:
  • ERM → minimize