Evidence Lower Bound (ELBO)
Used in VAE.
Uses multiple concepts for the derivation:
Why ELBO?
ELBO is useful because it provides a guarantee on the worst-case for the log-likelihood of some distribution (e.g. ) which models a set of data.
Deriving the ELBO
Start with the intractable log-likelihood:
Insert any distribution :
Interpret the integral as an expectation:
Apply Jensen’s Inequality ():
Define the ELBO:
Expand :
The last two terms form a KL divergence:
Final ELBO expression:
**In VAEs, we use .
Why is this a lower bound?
There is an exact identity:
Since the KL divergence is always nonnegative:
we get:
The gap is exactly:
Interpretation:
- ELBO is never larger than the true log-likelihood.
- Maximizing ELBO makes approach the true posterior.
- As improves, the KL gap shrinks.
- If , the ELBO is exactly equal to .
Thus ELBO is the best computable surrogate for the true log-likelihood.
Alternative derivation (CS231n 2025 Lec 13)
CS231n derives ELBO without invoking Jensen — pure algebra on Bayes’ rule. Start from
Multiply top and bottom by :
doesn’t depend on , so we can wrap the whole RHS in without changing anything:
The last term is a KL, so it’s — and we can’t compute it (it depends on the intractable true posterior ). Drop it to get a lower bound:
This is the VAE training objective — both terms are tractable (closed-form KL for Gaussians, Monte-Carlo reconstruction via the reparametrization trick).
Equivalent framing to the Jensen derivation above, but this one makes the “gap = KL to the true posterior” identity visible on a single line.
Source
CS231n 2025 Lec 13 slides ~92–102 (Bayes → multiply by → split into three log terms → wrap in expectation → identify two KLs → drop the posterior KL to get ELBO).