Autoregressive Model (AR)
Autoregressive models predict the next token in a sequence given the previous ones:
Examples:
- PixelRNN / PixelCNN (image)
- GPT, Transformer decoders (text)
- WaveNet (audio)
And it’s a term that seems to pop around. But I think that all it does is really allow you to predict (regress) future data based on past data. So isn’t this just a simple RNN?
They also mention this in the Neural Autoregressive Flow.
CS294
Assuming a fully expressive Bayes net structure, any joint distribution can be written as a product of conditionals
Walkthrough (CS231n 2025 Lec 13)
Autoregressive is the tractable-density branch of the generative taxonomy — you can actually compute , so you can train directly via MLE:
where is a neural network that parameterizes the density. The chain rule makes the joint factor into tokenwise conditionals:
Trained by predicting the next token — LLMs are autoregressive (RNN or masked Transformer).
Autoregressive models of images — PixelCNN
Treat an image as a 1D sequence of subpixels in raster / scanline order (top-left to bottom-right, R → G → B within each pixel). Each subpixel is an 8-bit integer, so the per-step prediction is a 256-way classification (softmax over pixel values) — this gives you exactly.
Problem: 1024×1024 RGB = ~3M sequential subpixel predictions per image. Sequential generation at inference is brutal. Modern AR image models (VQ-VAE + AR, MaskGIT) fix this by modeling tiles / discrete tokens instead of individual subpixels.
Source
CS231n 2025 Lec 13 slides ~48–59 (MLE objective, chain rule, LLM autoregressive note, PixelRNN/CNN scanline + 256-way softmax, 3M-subpixel problem).