Text-To-Image Models

These are generally Latent Diffusion Models.

From wikipedia: Text-to-image models are generally latent diffusion models, which combine a language model, which transforms the input text into a latent representation, and a generative image model, which produces an image conditioned on that representation.

Useful more for World Model / Video Generation Models research.