Is Value Learning Really the Main Bottleneck in Offline RL?

  • https://seohong.me/projects/offrl-bottlenecks/

  • For future algorithms research, we emphasize two important open questions in offline RL:

    • (1) What is the best way to extract a policy from the learned value function? Is there a better way than DDPG+BC?
    • (2) How can we train a policy in a way that it generalizes well on test-time states?
  • TakeawayDo not use weighted behavior cloning (AWR); always use behavior-constrained policy gradient (DDPG+BC). This enables better scaling of performance with more data and better use of the value function.

  • TakeawayTest-time policy generalization is one of the most significant bottlenecks in offline RL. Use high-coverage datasets. Improve policy accuracy on test-time states with on-the-fly policy improvement techniques.

  • I find this image to be quite helpful

Cited by Flow Q-Learning when it comes to value learning.

  1. First,we find that the choice of a policy extraction algorithm often has a larger impact on performance than value learning algorithms
  2. Second, we find that the performance of offline RL is often heavily bottlenecked by how well the policy generalizes to test-time states, rather than its performance on training states

They mention that DDPG + BC almost always does better.