Vision-Language Model (VLM)

VLMs are a type of generative models that take image and text inputs, and generate text outputs. Usually solve VQA or

Resources

https://huggingface.co/blog/vlms
https://huggingface.co/blog/vision_language_pretraining

Models

CLIP (but not generative, it’s contrastive)
PaliGemma
LLaVA
Prismatic VLM

In robotics, we fine-tune these VLMs to make VLAs.

There are a few setups:

In captioning / VQA tasks, the VLM conditions on image tokens and then predicts text tokens autoregressively (e.g., “a dog chasing a ball”).
In multimodal generation (e.g., image generation with text prompts), the VLM conditions on text tokens and predicts image tokens, which are then decoded back into pixels.
In bidirectional encoders (e.g., CLIP, ALIGN), the VLM doesn’t “predict” tokens autoregressively at all, but instead aligns text tokens with visual embeddings in a joint latent space

What does a VLM predict?

LLMs predict tokens. VLMs also predict tokens. If you’re thinking about image generation models, those are Text-To-Image Models.

🛠️ Steven Gong

Vision-Language Model (VLM)

Graph View

Backlinks