Visual Instruction Tuning (LLaVA)

Project page:

https://llava-vl.github.io/

Similar to PaliGemma, they also use CLIP as the visual encoder

Follow up paper:

https://arxiv.org/abs/2310.03744

Walkthrough (CS231n 2025 Lec 16)

Architecture

Three blocks: CLIP ViT → linear projection → LLaMA. The visual side feeds patch tokens from CLIP’s penultimate ViT layer (not the CLS token) — the penultimate features still carry spatial structure and CLIP-aligned semantics, while CLS would compress the whole image into one vector.

Training recipe

Initialize: pretrained LLaMA + pretrained CLIP, both frozen.
Stage 1 — train only the linear projection bridging CLIP features into the LLM token space. Cheap alignment step on image-caption data.
Stage 2 — finetune the LLM (and optionally the vision encoder) on >100K (image, instruction, GPT-4-generated response) tuples.

The trick is that the heavy lifting happens in stage 1 with just a linear layer, so the rest is standard instruction tuning.

Data

GPT-4 (text-only) is fed COCO captions + bounding boxes and asked to write three flavors of instruction data: conversations, detailed descriptions, and complex reasoning. Yields >100K multimodal instruction samples without any human annotators in the loop.

Source

CS231n 2025 Lec 16 slides ~115–120 (LLaVA architecture, penultimate-layer rationale, 3-stage training recipe, GPT-4 generated instruction data).

🛠️ Steven Gong

Table of Contents

Visual Instruction Tuning (LLaVA)

Walkthrough (CS231n 2025 Lec 16)

Architecture

Training recipe

Data

Source

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Visual Instruction Tuning (LLaVA)

Walkthrough (CS231n 2025 Lec 16)

Architecture

Training recipe

Data

Source

Related

Graph View

Backlinks