Octo: An Open-Source Generalist Robot Policy
Not really used, does not work well?
Here, unlike OpenVLA and RT-1, they actually have an action head, which takes the output embedding from the VLM, and does denoising.
How is this much different from Diffusion Policy?
It isn’t lol, diffusion policy has the same concept of taking the embeddings, and running denoising through it (but in diffusion policy, they just use a DiT, whereas in Octo, it’s pretrained).
Links:
Was really annoying for me to set up because I’m on mac.
Limitations as pointed out by Kevin black
At the CoRL 2024 cross-embodiment workshop.
- Uses a pretrained language encoder, but not a pretrained vision encoder (pi0 fixes that)
- The action head is really small (~5% of weights), which was a design choice so that it’s really easy to swap out the action heads, but not really good to capture multi-modal action distributions