Ο0: A Vision-Language-Action Flow Model for General Robot Control
Links:
The images and proprioceptive state are encoded via corresponding encoders and then projected via a linear projection layer into the same embedding space as the language tokens.
Model Architecture
β averaging over 10 trials per taskβ
- This is how many trials they do to get success rate
Why flow-matching?
- To ensure /constrain smooth robot outputs as opposed to random jumps in values
VLM Logic
LLMs predict sequences of tokens. What does a VLM predict?
Flow Matching Math
At training time, the Flow Matching loss to train the policy is given by
Where
- (Note: this should be ?)
Potential source of confusion
Notice that is always learning to predict , even though it is conditioned on . You might think really it should be learning , but youβd be mistaken.
- This multiplication by will be done at inference time to control βstep sizeβ
What's the point of
\tau
?Without , where you just start from and directly predict , thatβs essentially a denoising autoencoder view.
allows you to better capture multi-modal behavior, else you end up with mode-collapse, think about this scenario:
Flow matching trains on all intermediate points: Taking the derivative with regards to , we see that:
- The derivative of a straight line is a constant slope
We learn to predict this gradient (a constant) so that at inference time, we learn this mapping for any arbitrary
At inference time, they start with random noise and integrate the learned vector field from to . They and use the forward euler integration rule:
- where is the integration size ( in paper)
Why is 10 steps better than 1 step?
Because at the end of the day, we are learning a vector (a line essentially).
- ChatGPT answer: If you take 10 smaller steps, each step only needs to be locally correct. Integration keeps pulling you back onto the line. So error doesnβt explode; it averages out