GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
Project page:
Note
Interesting robot foundation model that in the pre-training stage learns to predict images, and during finetuning learns to predict both image and action.
“GR-2 is a language-conditioned GPT-style visual manipulation policy model (Fig. 1).
The training undergoes two stages:
- video generative pre-training and robot data fine-tuning. During the pre-training stage, we train GR-2 on a curated large-scale video dataset
- Fine-tune GR-2 on robot data to predict action trajectories and videos in tandem
They use a VQGAN. Why don’t they use a diffusion model?
This is their way of teaching a world model.