Latent Diffusion Model (LDM)

How is this different from your average diffusion model? Oh wow I have no idea about latent diffusion models, they’re the standard now.

The idea is quite simple, you go through encoder, do diffusion in latent space, and then decode this.

  • The idea is that diffusion is a very expensive process, but encoding / decoding is much faster

The paper HighResolution Image Synthesis with Latent Diffusion Models is what you should look at.

Walkthrough (CS231n 2025 Lec 14)

Why

Naive diffusion doesn’t scale to high-resolution data — every sampling step runs a big network on full-resolution pixels, and you need ~30–50 steps. Compress first, diffuse in the compact latent space, decompress at the end.

Architecture

  1. Encoder + decoder (CNNs with attention) compress images → latents . Common setting: , , so .
  2. Diffusion model trained to denoise latents (encoder is frozen during diffusion training).
  3. At inference: sample random latent → iteratively apply diffusion to denoise → run decoder to get image.

Training the encoder + decoder = VAE + GAN

This is the subtle part. The encoder/decoder is trained as:

  • A VAE — typically with a very small KL prior weight (so latents stay informative, not pushed all the way to ).
  • Plus a discriminator (GAN loss) — VAE alone gives blurry decoder outputs; adding a discriminator on the reconstructed image sharpens them.

Modern LDM pipelines use VAE + GAN + diffusion together: VAE for the encoder/decoder structure, GAN for sharp reconstructions, diffusion for the latent generative model.

Diffusion Transformer (DiT) backbone

Modern LDMs (FLUX, SD3, etc.) use Transformer-based diffusion backbones rather than U-Nets. Conditioning (timestep , text embeddings) is injected via predicted scale/shift (adaLN-Zero) or cross-attention. See Diffusion Model for the full DiT walkthrough.

Example: FLUX.1 [dev]

  • Text encoder: T5 + CLIP
  • VAE: 8× downsampling
  • Diffusion: 12B-parameter DiT, patchify → tokens
  • image ↔ latents

Source

CS231n 2025 Lec 14 slides ~85–102 (LDM pipeline diagram, encoder/decoder VAE training with small KL weight, blurry-decoder problem and discriminator fix, DiT injection of conditioning, FLUX.1 numbers).