🛠️ Steven Gong

Search

Aug 04, 2025, 1 min read

PaliGemma

By GoogleDeepMind

Resources

https://arxiv.org/pdf/2407.07726

pi0 uses this model

I don’t understand what this contrastive vision encoder refers to?

It’s just SigLIP

The images get mapped into the same embedding space as the text tokens.

Also see pi0 for how I explain this

Graph View

Backlinks

Gemma
Linear Projection
Vision Transformer (ViT)
Vision-Language Model (VLM)
π_0 - A Vision-Language-Action Flow Model for General Robot Control

Created with Quartz, © 2025

Blog
LinkedIn
Twitter
GitHub