RT-H: Action Hierarchies Using Language

Really cool paper by Ted Xiao at Deepmind

Resources

“Belkhale and Sadigh”

  • Mentioned from OpenVLA-OFT
  • “Recent works by Belkhale and Sadigh [2] and Pertsch et al. [36] improve VLA efficiency through new action tokenization schemes,”

  • This is the architecture

Doing this for all 9 action dimensions (3 dimensions for delta position, 3 dimensions for delta orientation, 2 dimensions for base movement, 1 dimension for gripper)

  • Tasks: flip bowl upright
  • Grab schooper