Vision-Language Action Model (VLA)

A VLA (a.k.a. Large Behavior Model, LBM) is a robot foundation model that maps (observation, goal) → action with no explicit state representation or transition function — the policy itself is the model.

Why call them VLAs vs LBMs?

Same object, different communities. “VLA” emphasizes the heritage from VLMs (a VLM whose output head emits actions instead of text); “LBM” emphasizes the analogy to LLMs (large model trained on behavior data). CS231n 2025 Lec 17 uses both names interchangeably.

The honest framing from CS231n 2025 Lec 17 (slide 82): current VLMs aren’t always perfect but always produce something reasonable. By analogy, a robotic foundation model’s synthesized action won’t always be optimal but the trajectory should always be beautiful and reasonable. That’s the bar.

See Robot Foundation Models for the model lineup (RT-1 → RT-2 → RT-X → OpenVLA → π₀ → Helix / Hi-Robot / Gemini Robotics / GR00T / DYNA-1) and Robot Learning for the full Lec 17 walkthrough.

🛠️ Steven Gong

Vision-Language Action Model (VLA)

Graph View

Backlinks