RT-1: Robotics Transformer for Real-World Control at Scale

My question is, how is it trained on some of the datasets where there are no instruction annotations??

  • Then, the instruction is just empty

“It takes in a history of 15 images along with the natural language“.