Foundation Model

A foundation model is a large model pretrained on broad data at scale that can be adapted (zero-shot, few-shot, or finetuned) to many downstream tasks. GPT-4 has ~1.76T params; CLIP-ViT-L is 307M.

Why bundle these into one category?

Because the adaptation pattern is the same across modalities — pretrain once on the open web, then point the model at any task without retraining the backbone. Calling out the family makes it easy to compare what’s actually in the wild and where the gaps are (e.g. open-weight robotics foundation models lag closed VLMs by years).

Taxonomy (CS231n 2025 Lec 16)

Ranjay Krishna’s slide-111 grouping:

CategoryModels
LanguageELMo, BERT, [[notes/GPT
ClassificationCLIP, CoCa
LM + VisionLLaVA, Flamingo, GPT-4V, Gemini, Molmo
And More!Segment Anything, Whisper, DALL-E, Stable Diffusion, Imagen
ChainingLMs + CLIP (CuPL), Visual Programming (VisProg)

The chaining column is the interesting one: instead of training a bigger end-to-end model, route an LLM around existing foundation models.

CuPL — “What does a platypus look like?”

Pratt et al. 2023. Rare classes (marimba, viaduct, papillon, lorikeet) underperform with bare class names. Pipeline:

  1. GPT-3 prompted “What does a {class} look like?” → “A lorikeet is a small to medium-sized parrot with brightly colored plumage.”
  2. Feed the description to CLIP for zero-shot classification.

Results vs CLIP baseline: ImageNet 75.54→76.19 (+0.65), DTD 55.20→58.90 (+3.70), SUN397 +3.43, FGVC Aircraft +3.81. Also collapses the prompt set from ~80 hand-written templates to ~3.

VisProg — Visual Programming (Gupta et al. 2023)

GPT writes a Python script that calls existing vision models from in-context examples. Each module is a real model:

  • Image understanding: Loc (OWL-ViT), FaceDet (DSFD), Seg (MaskFormer), Select / Classify (CLIP-ViT), Vqa (ViLT)
  • Image manipulation: Replace (Stable Diffusion), ColorPop, BgBlur, Tag, Emoji, Crop variants
  • Knowledge / arithmetic: List (GPT-3), Eval, Count, Result

Example program:

OBJ0=Facedet(image=IMAGE)
LIST0=List(query='main characters on TV show Big Bang Theory', max=7)
OBJ1=Classify(image=IMAGE, object=OBJ0, categories=LIST0)
IMAGE0=Tag(image=IMAGE, object=OBJ1)

Solves NLVR (compare two images), open-ended VQA, and image edits without training a new model.

Foundation Model in Robotics

What about latency?

This guy’s viewpoint is not wrong https://twitter.com/ChombaBupe/status/1626250221452754950 if you apply it to self-driving cars for example. It will need much lower latency to get things working.

Source

CS231n 2025 Lec 16 slides ~111, 132–145 (foundation-model taxonomy, CuPL, VisProg).