Robot Foundation Models

Ο€0: A Vision-Language-Action Flow Model for General Robot Control

Links:

The images and proprioceptive state are encoded via corresponding encoders and then projected via a linear projection layer into the same embedding space as the language tokens.

Model Architecture

” averaging over 10 trials per task”

  • This is how many trials they do to get success rate

Why flow-matching?

  • To ensure /constrain smooth robot outputs as opposed to random jumps in values

VLM Logic

LLMs predict sequences of tokens. What does a VLM predict?

Flow Matching Math

At training time, the Flow Matching loss to train the policy is given by

Where

  • (Note: this should be ?)

Potential source of confusion

Notice that is always learning to predict , even though it is conditioned on . You might think really it should be learning , but you’d be mistaken.

  • This multiplication by will be done at inference time to control β€œstep size”

What's the point of \tau?

Without , where you just start from and directly predict , that’s essentially a denoising autoencoder view.

allows you to better capture multi-modal behavior, else you end up with mode-collapse, think about this scenario:

Flow matching trains on all intermediate points: Taking the derivative with regards to , we see that:

  • The derivative of a straight line is a constant slope

We learn to predict this gradient (a constant) so that at inference time, we learn this mapping for any arbitrary

At inference time, they start with random noise and integrate the learned vector field from to . They and use the forward euler integration rule:

  • where is the integration size ( in paper)

Why is 10 steps better than 1 step?

Because at the end of the day, we are learning a vector (a line essentially).

  • ChatGPT answer: If you take 10 smaller steps, each step only needs to be locally correct. Integration keeps pulling you back onto the line. So error doesn’t explode; it averages out