Generative Model

Autoregressive Model (AR)

Autoregressive models predict the next token in a sequence given the previous ones:

Examples:

  • PixelRNN / PixelCNN (image)
  • GPT, Transformer decoders (text)
  • WaveNet (audio)

And it’s a term that seems to pop around. But I think that all it does is really allow you to predict (regress) future data based on past data. So isn’t this just a simple RNN?

They also mention this in the Neural Autoregressive Flow.

CS294

Assuming a fully expressive Bayes net structure, any joint distribution can be written as a product of conditionals

Walkthrough (CS231n 2025 Lec 13)

Autoregressive is the tractable-density branch of the generative taxonomy — you can actually compute , so you can train directly via MLE:

where is a neural network that parameterizes the density. The chain rule makes the joint factor into tokenwise conditionals:

Trained by predicting the next token — LLMs are autoregressive (RNN or masked Transformer).

Autoregressive models of images — PixelCNN

Treat an image as a 1D sequence of subpixels in raster / scanline order (top-left to bottom-right, R → G → B within each pixel). Each subpixel is an 8-bit integer, so the per-step prediction is a 256-way classification (softmax over pixel values) — this gives you exactly.

Problem: 1024×1024 RGB = ~3M sequential subpixel predictions per image. Sequential generation at inference is brutal. Modern AR image models (VQ-VAE + AR, MaskGIT) fix this by modeling tiles / discrete tokens instead of individual subpixels.

Source

CS231n 2025 Lec 13 slides ~48–59 (MLE objective, chain rule, LLM autoregressive note, PixelRNN/CNN scanline + 256-way softmax, 3M-subpixel problem).