Foundation Model

A foundation model is a large model pretrained on broad data at scale that can be adapted (zero-shot, few-shot, or finetuned) to many downstream tasks. GPT-4 has ~1.76T params; CLIP-ViT-L is 307M.

Why bundle these into one category?

Because the adaptation pattern is the same across modalities — pretrain once on the open web, then point the model at any task without retraining the backbone. Calling out the family makes it easy to compare what’s actually in the wild and where the gaps are (e.g. open-weight robotics foundation models lag closed VLMs by years).

Taxonomy (CS231n 2025 Lec 16)

Ranjay Krishna’s slide-111 grouping:

Category	Models
Language	ELMo, BERT, [[notes/GPT
Classification	CLIP, CoCa
LM + Vision	LLaVA, Flamingo, GPT-4V, Gemini, Molmo
And More!	Segment Anything, Whisper, DALL-E, Stable Diffusion, Imagen
Chaining	LMs + CLIP (CuPL), Visual Programming (VisProg)

The chaining column is the interesting one: instead of training a bigger end-to-end model, route an LLM around existing foundation models.

CuPL — “What does a platypus look like?”

Pratt et al. 2023. Rare classes (marimba, viaduct, papillon, lorikeet) underperform with bare class names. Pipeline:

GPT-3 prompted “What does a {class} look like?” → “A lorikeet is a small to medium-sized parrot with brightly colored plumage.”
Feed the description to CLIP for zero-shot classification.

Results vs CLIP baseline: ImageNet 75.54→76.19 (+0.65), DTD 55.20→58.90 (+3.70), SUN397 +3.43, FGVC Aircraft +3.81. Also collapses the prompt set from ~80 hand-written templates to ~3.

VisProg — Visual Programming (Gupta et al. 2023)

GPT writes a Python script that calls existing vision models from in-context examples. Each module is a real model:

Image understanding: Loc (OWL-ViT), FaceDet (DSFD), Seg (MaskFormer), Select / Classify (CLIP-ViT), Vqa (ViLT)
Image manipulation: Replace (Stable Diffusion), ColorPop, BgBlur, Tag, Emoji, Crop variants
Knowledge / arithmetic: List (GPT-3), Eval, Count, Result

Example program:

OBJ0=Facedet(image=IMAGE)
LIST0=List(query='main characters on TV show Big Bang Theory', max=7)
OBJ1=Classify(image=IMAGE, object=OBJ0, categories=LIST0)
IMAGE0=Tag(image=IMAGE, object=OBJ1)

Solves NLVR (compare two images), open-ended VQA, and image edits without training a new model.