🛠️ Steven Gong

Search

Sep 11, 2025, 1 min read

PaliGemma

pi0 uses this model.

Paligemma takes a pretrained SigLIP image encoder (a ViT) and a pretrained Gemma.

I don’t understand what this contrastive vision encoder refers to?

It’s just SigLIP
SigLIP trains both image encoder and text encoder, and uses a contrastive loss to make sure that the image and text are close in terms of embedding. But then with paligemma, it just doesn’t use the text encoder part of SigLIP

Why doesn't PaliGemma use the text encoder too from SigLIP?

Because Gemma was not trained to understand the encoded embeddings from SigLIPs. It already has an embedding table that it understands. The only gap to close is the images, and figure out how to project the image’s embedding into the embedding space of the LLM.

Graph View

Backlinks

Linear Projection
Vision-Language Model (VLM)
π_0 - A Vision-Language-Action Flow Model for General Robot Control (pi0)
Gemma: Open Models Based on Gemini Research and Technology
Visual Instruction Tuning (LLaVA)

Created with Quartz, © 2025

Blog
LinkedIn
Twitter
GitHub