Flamingo
Flamingo bolts a vision pathway onto a frozen LLM via two learned components, then trains on interleaved image+text web data — that interleaving is what gives it few-shot, in-context visual reasoning.
Walkthrough (CS231n 2025 Lec 16)
What stays frozen, what is learned
- Frozen: vision encoder (NFNet) and the LM (Chinchilla).
- Learned: (1) Perceiver Resampler, (2) Gated XATTN-DENSE layers inserted between LM blocks.
Perceiver Resampler
A small Transformer with a fixed set of learned query tokens that cross-attends into the variable-length image/video token stream. Output: a fixed number of visual tokens regardless of input resolution or video length. This makes the LM’s input shape predictable.
GATED XATTN-DENSE
Inserted between every frozen LM block:
y = y + tanh(alpha_xattn) * attention(q=y, kv=x) # learned cross-attn into visual tokens
y = y + tanh(alpha_dense) * ffw(y) # learned dense
y = y + frozen_attention(q=y, kv=y) # original frozen LM self-attn
y = y + frozen_ffw(y) # original frozen LM dense
Both alpha_xattn and alpha_dense initialize at 0. At step 0 the gated layers are identity — the frozen LM is preserved exactly. Training gradually opens the gate, which both stabilizes training and prevents catastrophic interference with the LM’s pretrained behavior.
Interleaved training
Sequences look like <image>caption<eos><image>caption<eos>... scraped from the web. Attention is masked so each text token attends only to the most recent preceding image, not all images in the sequence. This is the structural prior that makes few-shot in-context learning work at inference time — show two (image, label) demonstrations, then a test image, and the model generalizes.
Source
CS231n 2025 Lec 16 slides ~121–127 (Perceiver Resampler, Gated XATTN-DENSE pseudocode, alpha-init-0 trick, interleaved training with masked attention).