VLM, Gemma

PaliGemma

By GoogleDeepMind

Resources

pi0 uses this model

I don’t understand what this contrastive vision encoder refers to?

The images get mapped into the same embedding space as the text tokens.

  • Also see pi0 for how I explain this