High-Resolution Image Synthesis with Latent Diffusion Models

By predicting in a latent space, we can reduce the computational load.

Two stage training:

  1. Train an autoencoder to encode images in latent space
  2. Train diffusion to predict in latent space

“Being likelihood-based models, they do not exhibit mode-collapse and training instabilities as GANs and, by heavily exploiting parameter sharing, they can model highly complex distributions of natural images without involving billions of parameters as in AR models”