What Matters for Batch Online Reinforcement Learning

RL is from the Q function

  • And they use IQL

One really important thing that they show is that implicit policy extraction is a lot better than explicit.

  • Explicit is training the policy using the q-function
  • Implicit policy extraction is doing multi-trajectory sampling, scoring trajectories, and selecting it