GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Project page:

Note

Interesting robot foundation model that in the pre-training stage learns to predict images, and during finetuning learns to predict both image and action.

“GR-2 is a language-conditioned GPT-style visual manipulation policy model (Fig. 1).

The training undergoes two stages:

  1. video generative pre-training and robot data fine-tuning. During the pre-training stage, we train GR-2 on a curated large-scale video dataset
  2. Fine-tune GR-2 on robot data to predict action trajectories and videos in tandem

They use a VQGAN. Why don’t they use a diffusion model?

This is their way of teaching a world model.