π0: A Vision-Language-Action Flow Model for General Robot Control

Successor to Octo model.

Links:

https://www.physicalintelligence.company/blog/pi0
https://www.youtube.com/live/ELUMFpJCUS0?t=16866s kevin black motivating the architectural design (coming from Octo) starting at 04:41:06
Me trying to explain how the architecture works: https://www.youtube.com/watch?v=NVS-7VJMA5c

Two main contributions:

Applying VLM to VLA via flow matching
1. Leverages large-scale internet pretraining from VLM
Data recipe

” averaging over 10 trials per task”

This is how many trials they do to get success rate

Model Architecture

It’s a MoE-style architecture composed of 2 experts:

Expert 1 (VLM): PaliGemma (3 billion params) pre-trained from internet
Expert 2(large transformer): Gemma action expert (300 million params) from scratch
- The Gemma expert is a dedicated transformer stack for actions

The two experts talk to each other via Blockwise causal attention.

This is a really important detail that is not shown in the paper

There are 18 transformer blocks (i.e. depth in the paper)
Width = the model’s hidden size (a.k.a. embedding dim, model dim).
- For the VLM expert (PaliGemma backbone): width = 2048
- For the Action expert: width = 1024
Note that the width doesn’t have to match between the two experts, just during attention, the head dim needs to be the same
- Both experts map into the same attention head space ( $18$ heads $\times 256$ hidden dim = $4608$ ).
- Concatenate tokens → do one global attention.
- Split results → project back into each expert’s private width (2048 vs 1024).
- Enables cross-attention between VLM and Action tokens, while keeping parameters separate.

Why use separate expert as opposed to just introducing new tokens to the VLM?

That makes it a giant decoder, and is what Kevin black actually originally tried for the pi0 model. However, convergence is really slow, and distribution shift. it will make the model super confused ( explained at ~5:01:31 of the talk)

Idea: use a whole set of different weights for the action expert, which is trained from scratch.

Also another thing kevin pointed out: leveraging pre-training is super-duper important!!

Flow Matching

At training time, the Flow Matching loss to train the policy is given by $L_{τ} (θ) = E_{p (A_{t} ∣ o_{t}), q (A_{t}^{τ} ∣ A_{t})} ∥ v_{θ} (A_{t}^{τ}, o_{t}) - u (A_{t}^{τ} ∣ A_{t}) ∥^{2}$

Where

$u (A_{t}^{τ} ∣ A_{t}) = ϵ - A_{t}$ (Note: this should be $A_{t} - ϵ$ ?)
$A_{t}^{τ} = τ A_{t} + (1 - τ) ϵ$

Potential source of confusion

Notice that $v_{θ} (A_{t}^{τ}, o_{t})$ is always learning to predict $ϵ - A_{t}$ , even though it is conditioned on $A_{t}^{τ}$ . You might think really it should be learning $τ (ϵ - A_{t})$ , but that would be learning the distance vector - We are trying to learn the velocity field, which stays constant through (the first derivative).

This multiplication by $τ$ will be done at inference time to control “step size”

What's the point of $\tau$ ?

Without $τ$ , where you just start from $ϵ$ and directly predict $A_{t}$ , that’s essentially a denoising autoencoder view.

$τ$ allows you to better capture multi-modal behavior, else you end up with mode-collapse, think about this scenario:

Background on flow matching

ChatGPT conversation: https://chatgpt.com/share/68c5b5fa-7db4-8002-9708-ef4e953533f9

The idea: we want to turn noise $ϵ \sim N (0, I)$ into a data sample (here, an action $A_{t}$ ). We can describe this transformation as continuous process over time $τ \in [0, 1]$ , i.e. an ODE:

\frac{d}{d τ} X^{τ} = v (X^{τ}, τ),

with initial condition $X^{0} = ϵ$ and at the end, $X^{1} \approx A_{t}$ . Flow matching tries to learn this vector field $v (\cdot)$ .

To train such a model, we choose reference path between $ϵ$ and $A_{t}$ and differentiate over it. The simplest path between $ϵ$ and $A_{t}$ is a straight line (i.e. straight line interpolation): $A_{t}^{τ} = (1 - τ) ϵ + τ A_{t}, τ \in [0, 1]$

At $τ = 0$ , you are at pure noise, and at $τ = 1$ , you’re at the action

Taking the derivative with regards to $τ$ , we see that: $\frac{d}{d τ} A_{t}^{τ} = \frac{d}{d τ} ((1 - τ) ϵ + τ A_{t}) = A_{t} - ϵ$

The derivative of a straight line is a constant slope, so we just need to learn this constant!

We learn to predict this gradient (a constant) so that at inference time, we learn this mapping for any arbitrary $A_{t}^{τ}$ $\frac{d}{d τ} A_{t}^{τ} = v_{θ} (A_{t}^{τ}, o_{t}) = A_{t} - ϵ$

At inference time, we start with random noise $A_{0}^{t} \sim N (0, I)$ and integrate the learned vector field from $τ = 0$ to $τ = 1$ , and use forward Euler integration rule: $A_{t}^{τ + δ} = A_{t}^{τ} + δ v_{θ} (A_{t}^{τ}, o_{t})$

where $δ$ is the integration size ( $δ = 0.1$ in paper)

Why is 10 steps better than 1 step?

Because at the end of the day, we are trying to learn multi-modal distributions.

ChatGPT answer: If you take 10 smaller steps, each step only needs to be locally correct. Integration keeps pulling you back onto the line. So error doesn’t explode; it averages out

🛠️ Steven Gong

Table of Contents

π0: A Vision-Language-Action Flow Model for General Robot Control

Model Architecture

Flow Matching

Background on flow matching

Graph View

Backlinks