OpenVLA

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success (OpenVLA-OFT)

Paper that introduces finetuning for OpenVLA.

Links

Architecture

Two main contributions:

  1. Add parallel decoding
  2. Film for better adherence to instructions

Parallel decoding seems like an interesting way to increase inference speed.

Has some really really good visualizations of different tasks and how they fail