🛠️ Steven Gong

Search

Aug 24, 2025, 1 min read

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success (OpenVLA-OFT)

Paper that introduces finetuning for OpenVLA.

Links

https://openvla-oft.github.io/
https://github.com/moojink/openvla-oft

Architecture

Llama 2 (why 2??)
FiLM

Two main contributions:

Add parallel decoding
Film for better adherence to instructions

Parallel decoding seems like an interesting way to increase inference speed.

Has some really really good visualizations of different tasks and how they fail

Graph View

Backlinks

Feature-wise Linear Modulation (FiLM)
OpenVLA: An Open-Source Vision-Language-Action Model
RT-H: Action Hierarchies Using Language

Created with Quartz, © 2025

Blog
LinkedIn
Twitter
GitHub