Distributional Shift

This is a very big issue in Offline RL, the paper Offline Reinforcement Learning Tutorial Review and Perspectives on Open Problems talks about this issue in section 4.2.

What is distributional shift in this offline RL case? Your training data distribution is different from your test time distribution.

  • “while our function approximator (policy, value function, or model) might be trained under one distribution, it will be evaluated on a different distribution”.

The solution is to apply conservatism: Conservative Q-Learning. However, during online learning, this can be a really bad thing, as we’ve learned really bad estimates for out of distribution states. WSRL talks about this, but I’m still not convinced on how WSRL works.