Evidence Lower Bound (ELBO)

Used in VAE.

Uses multiple concepts for the derivation:

Why ELBO?

ELBO is useful because it provides a guarantee on the worst-case for the log-likelihood of some distribution (e.g. ) which models a set of data.

Deriving the ELBO

Start with the intractable log-likelihood:

Insert any distribution :

Interpret the integral as an expectation:

Apply Jensen’s Inequality ():

Define the ELBO:

Expand :

The last two terms form a KL divergence:

Final ELBO expression:

**In VAEs, we use .

Why is this a lower bound?

There is an exact identity:

Since the KL divergence is always nonnegative:

we get:

The gap is exactly:

Interpretation:

  • ELBO is never larger than the true log-likelihood.
  • Maximizing ELBO makes approach the true posterior.
  • As improves, the KL gap shrinks.
  • If , the ELBO is exactly equal to .

Thus ELBO is the best computable surrogate for the true log-likelihood.

Alternative derivation (CS231n 2025 Lec 13)

CS231n derives ELBO without invoking Jensen — pure algebra on Bayes’ rule. Start from

Multiply top and bottom by :

doesn’t depend on , so we can wrap the whole RHS in without changing anything:

The last term is a KL, so it’s — and we can’t compute it (it depends on the intractable true posterior ). Drop it to get a lower bound:

This is the VAE training objective — both terms are tractable (closed-form KL for Gaussians, Monte-Carlo reconstruction via the reparametrization trick).

Equivalent framing to the Jensen derivation above, but this one makes the “gap = KL to the true posterior” identity visible on a single line.

Source

CS231n 2025 Lec 13 slides ~92–102 (Bayes → multiply by → split into three log terms → wrap in expectation → identify two KLs → drop the posterior KL to get ELBO).