Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

A diffusion model is trained to denoise a set of tokens with independent per-token noise levels.

Diffusion forcing = Teacher Forcing + Diffusion Model