Visual Instruction Tuning (LLaVA)

Project page:

Follow up paper:

Walkthrough (CS231n 2025 Lec 16)

Architecture

Three blocks: CLIP ViT β†’ linear projection β†’ LLaMA. The visual side feeds patch tokens from CLIP’s penultimate ViT layer (not the CLS token) β€” the penultimate features still carry spatial structure and CLIP-aligned semantics, while CLS would compress the whole image into one vector.

Training recipe

  1. Initialize: pretrained LLaMA + pretrained CLIP, both frozen.
  2. Stage 1 β€” train only the linear projection bridging CLIP features into the LLM token space. Cheap alignment step on image-caption data.
  3. Stage 2 β€” finetune the LLM (and optionally the vision encoder) on >100K (image, instruction, GPT-4-generated response) tuples.

The trick is that the heavy lifting happens in stage 1 with just a linear layer, so the rest is standard instruction tuning.

Data

GPT-4 (text-only) is fed COCO captions + bounding boxes and asked to write three flavors of instruction data: conversations, detailed descriptions, and complex reasoning. Yields >100K multimodal instruction samples without any human annotators in the loop.

Source

CS231n 2025 Lec 16 slides ~115–120 (LLaVA architecture, penultimate-layer rationale, 3-stage training recipe, GPT-4 generated instruction data).