Flamingo

Flamingo bolts a vision pathway onto a frozen LLM via two learned components, then trains on interleaved image+text web data — that interleaving is what gives it few-shot, in-context visual reasoning.

Walkthrough (CS231n 2025 Lec 16)

What stays frozen, what is learned

Frozen: vision encoder (NFNet) and the LM (Chinchilla).
Learned: (1) Perceiver Resampler, (2) Gated XATTN-DENSE layers inserted between LM blocks.

Perceiver Resampler

A small Transformer with a fixed set of learned query tokens that cross-attends into the variable-length image/video token stream. Output: a fixed number of visual tokens regardless of input resolution or video length. This makes the LM’s input shape predictable.

GATED XATTN-DENSE

Inserted between every frozen LM block:

y = y + tanh(alpha_xattn) * attention(q=y, kv=x)   # learned cross-attn into visual tokens
y = y + tanh(alpha_dense) * ffw(y)                  # learned dense
y = y + frozen_attention(q=y, kv=y)                 # original frozen LM self-attn
y = y + frozen_ffw(y)                               # original frozen LM dense

Both alpha_xattn and alpha_dense initialize at 0. At step 0 the gated layers are identity — the frozen LM is preserved exactly. Training gradually opens the gate, which both stabilizes training and prevents catastrophic interference with the LM’s pretrained behavior.

Interleaved training

Sequences look like <image>caption<eos><image>caption<eos>... scraped from the web. Attention is masked so each text token attends only to the most recent preceding image, not all images in the sequence. This is the structural prior that makes few-shot in-context learning work at inference time — show two (image, label) demonstrations, then a test image, and the model generalizes.

Source

CS231n 2025 Lec 16 slides ~121–127 (Perceiver Resampler, Gated XATTN-DENSE pseudocode, alpha-init-0 trick, interleaved training with masked attention).

🛠️ Steven Gong

Table of Contents

Flamingo

Walkthrough (CS231n 2025 Lec 16)

What stays frozen, what is learned

Perceiver Resampler

GATED XATTN-DENSE

Interleaved training

Source

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Flamingo

Walkthrough (CS231n 2025 Lec 16)

What stays frozen, what is learned

Perceiver Resampler

GATED XATTN-DENSE

Interleaved training

Source

Related

Graph View

Backlinks