Octo: An Open-Source Generalist Robot Policy

Not really used, does not work well?

Here, unlike OpenVLA and RT-1, they actually have an action head, which takes the output embedding from the VLM, and does denoising.

How is this much different from Diffusion Policy?

It isn’t lol, diffusion policy has the same concept of taking the embeddings, and running denoising through it (but in diffusion policy, they just use a DiT, whereas in Octo, it’s pretrained).

Links:

https://octo-models.github.io/
Repo: https://github.com/octo-models/octo/tree/main

Was really annoying for me to set up because I’m on mac.

Limitations as pointed out by Kevin black

At the CoRL 2024 cross-embodiment workshop.

Uses a pretrained language encoder, but not a pretrained vision encoder (pi0 fixes that)

The action head is really small (~5% of weights), which was a design choice so that it’s really easy to swap out the action heads, but not really good to capture multi-modal action distributions

pi0 is that attempt to scale up Octo.

🛠️ Steven Gong

Octo: An Open-Source Generalist Robot Policy

Graph View

Backlinks