PixelRNN / PixelCNN
Explicit tractable-density generative models for images. Treat the image as a 1D sequence of subpixels in raster / scanline order and autoregressively predict each next subpixel from all previous ones:
Each subpixel is an 8-bit integer, so the per-step model is a 256-way softmax classification. The loss is plain cross-entropy — no variational tricks, no adversarial game, and can be evaluated exactly for any image.
Why 256-way softmax instead of predicting a continuous value?
Treating subpixels as categorical captures multimodality — e.g. an edge pixel could plausibly be dark OR bright, but almost never the mid-gray average. A regression loss (L2) would collapse to the blurry mean.
PixelRNN (van den Oord, ICML 2016)
Row-by-row generation with an RNN (LSTM variants — Row-LSTM, Diagonal BiLSTM). Context for each position comes from the recurrent hidden state carrying everything above and to the left.
PixelCNN (van den Oord, NeurIPS 2016)
Replace the RNN with masked convolutions so the receptive field at each position only includes already-generated pixels (top-left neighborhood). Training is fully parallel over an image (same as training an autoregressive Transformer) — much faster than the recurrent PixelRNN.
Sampling is still sequential: one subpixel at a time, conditioned on all previously sampled ones.
Problem: scale
A 1024×1024 RGB image is 3 million subpixels = 3M sequential sampling steps. Even 256×256 is 200K. This is why modern autoregressive image models (VQ-VAE + AR, ImageGPT, MaskGIT, Parti) model tiles or discrete tokens instead of raw subpixels — reduce sequence length by 100–1000×.
Related
Source
CS231n 2025 Lec 13 slides ~55–60 (PixelRNN/PixelCNN as autoregressive models of images, raster-order subpixels, 256-way softmax, 3M-subpixel scaling problem, foreshadow of tile-based AR).