Vision-Language Model (VLM)
VLMs are a type of generative models that take image and text inputs, and generate text outputs. Usually solve VQA or
Resources
Models
- CLIP (but not generative, it’s contrastive)
- PaliGemma
- LLaVA
- Prismatic VLM
In robotics, we fine-tune these VLMs to make VLAs.
There are a few setups:
- In captioning / VQA tasks, the VLM conditions on image tokens and then predicts text tokens autoregressively (e.g., “a dog chasing a ball”).
- In multimodal generation (e.g., image generation with text prompts), the VLM conditions on text tokens and predicts image tokens, which are then decoded back into pixels.
- In bidirectional encoders (e.g., CLIP, ALIGN), the VLM doesn’t “predict” tokens autoregressively at all, but instead aligns text tokens with visual embeddings in a joint latent space
What does a VLM predict?
LLMs predict tokens. VLMs also predict tokens. If you’re thinking about image generation models, those are Text-To-Image Models.
Walkthrough (CS231n 2025 Lec 16)
LLaVA — the simple recipe
Vision encoder (CLIP ViT, penultimate layer patch tokens — not CLS, since CLS throws away spatial info) → linear bridge → frozen/finetuned LLM (LLaMA). Recipe:
- Initialize with a pretrained LLM and pretrained CLIP.
- Freeze both, train only the linear bridge to align modalities.
- Then finetune the LLM (and optionally the vision encoder) on >100K image+instruction+output samples.
Flamingo — gated cross-attention into a frozen LM
Frozen vision encoder + frozen LM, with two learned bits:
- Perceiver Resampler: down-samples a variable number of image/video tokens to a fixed number — lets the LM see arbitrary visual context length.
- GATED XATTN-DENSE layers, inserted between every LM block:
y = y + tanh(alpha_xattn) * attention(q=y, kv=x)
y = y + tanh(alpha_dense) * ffw(y)
y = y + frozen_attention(q=y, kv=y)
y = y + frozen_ffw(y)
Both alpha_xattn and alpha_dense initialize at 0, so at start of training the gated layers are no-ops — the frozen LM’s behavior is preserved, then training gradually opens the gate. Training uses interleaved <image>...<eos> sequences with masked attention so each text token attends only to the most recent preceding image. This is what unlocks in-context few-shot for vision tasks.
Molmo — fully open VLM (Sep 2024)
Allen AI released weights + data + code + evals. Architecture: per-patch CLIP → connector → LLM. Outputs include grounded points, e.g. <point x="63.5" y="44.5" alt="Mt Rainier">Mt Rainier</point>.
- PixMo dataset: 700K image-text pairs (vs LLaMA 3.1V’s 6B). Quality over quantity — collected by having annotators speak 60-90s descriptions then transcribing (“people don’t like to type… but they love to talk”).
- Results: Molmo 72B academic 81.2 (beats Qwen-VL2 79.4, GPT-4o 78.5). Human Elo: Molmo 72B = 1076 (2nd, just behind GPT-4o 1079; beats Gemini 1.5 Pro, Claude 3.5 Sonnet). Molmo 7B-D Elo 1056 beats Gemini 1.5 Flash, GPT-4V, Qwen-VL2 72B.
- Pointing: outputs grounded coordinates, enabling chaining with SAM (e.g. point at cricket bat → SAM segments it). Demo: molmo.allenai.org.
Source
CS231n 2025 Lec 16 slides ~111–131 (LLaVA recipe, Flamingo Perceiver Resampler + Gated XATTN-DENSE, Molmo + PixMo + Elo results).