Vision-Language Model (VLM)
Resources
Models
- CLIP (but not generative)
- PaliGemma
- LLaVA
- Google Gemini
- Prismatic VLM
In robotics, we fine-tune these VLMs to make VLAs.
There are a few setups:
- In captioning / VQA tasks, the VLM conditions on image tokens and then predicts text tokens autoregressively (e.g., “a dog chasing a ball”).
- In multimodal generation (e.g., image generation with text prompts), the VLM conditions on text tokens and predicts image tokens, which are then decoded back into pixels.
- In bidirectional encoders (e.g., CLIP, ALIGN), the VLM doesn’t “predict” tokens autoregressively at all, but instead aligns text tokens with visual embeddings in a joint latent space
What does a VLM predict?
LLMs predict tokens.