Vision-Language Model (VLM)

Resources

Models

In robotics, we fine-tune these VLMs to make VLAs.

There are a few setups:

  • In captioning / VQA tasks, the VLM conditions on image tokens and then predicts text tokens autoregressively (e.g., “a dog chasing a ball”).
  • In multimodal generation (e.g., image generation with text prompts), the VLM conditions on text tokens and predicts image tokens, which are then decoded back into pixels.
  • In bidirectional encoders (e.g., CLIP, ALIGN), the VLM doesn’t “predict” tokens autoregressively at all, but instead aligns text tokens with visual embeddings in a joint latent space

What does a VLM predict?

LLMs predict tokens.