Visual Instruction Tuning (LLaVA)
Project page:

Follow up paper:
Walkthrough (CS231n 2025 Lec 16)
Architecture
Three blocks: CLIP ViT β linear projection β LLaMA. The visual side feeds patch tokens from CLIPβs penultimate ViT layer (not the CLS token) β the penultimate features still carry spatial structure and CLIP-aligned semantics, while CLS would compress the whole image into one vector.
Training recipe
- Initialize: pretrained LLaMA + pretrained CLIP, both frozen.
- Stage 1 β train only the linear projection bridging CLIP features into the LLM token space. Cheap alignment step on image-caption data.
- Stage 2 β finetune the LLM (and optionally the vision encoder) on >100K (image, instruction, GPT-4-generated response) tuples.
The trick is that the heavy lifting happens in stage 1 with just a linear layer, so the rest is standard instruction tuning.
Data
GPT-4 (text-only) is fed COCO captions + bounding boxes and asked to write three flavors of instruction data: conversations, detailed descriptions, and complex reasoning. Yields >100K multimodal instruction samples without any human annotators in the loop.
Source
CS231n 2025 Lec 16 slides ~115β120 (LLaVA architecture, penultimate-layer rationale, 3-stage training recipe, GPT-4 generated instruction data).