Offline Reinforcement Learning

Offline RL uses previously collected data without any additional data collection.

Offline RL is essentially a tug-of-war between behavioral regularization and value maximization.

Resources / readings:

Techniques:

Notation below from Conservative Q-Learning. We have a Replay buffer that we are learning from: Policy Evaluation step via Bellman Expectation Backup

Policy improvement

Notice that we train our Q from a behavior policy out of by sampling these from D. and then follow a behavior policy that tries to maximize . However, this can get biased towards actions on Q that are OOD.

As a result, offline RL tends to overestimate the Q values.

But isn't this a problem in online RL as well???

Yes, but In online Q-learning, you can actually try those overestimated actions during exploration.

Say the Q-network hallucinates that is high. The policy tries (via Ξ΅-greedy). If the reward is actually low, the Q-value is corrected:

πŸ” So exploration and environment feedback act as a self-correcting mechanism.

So like it gets corrected. But in offline RL, you don’t have the benefit of controlling your exploration. You don’t get GPI.

That’s why conservative offline RL methods (like CQL, BCQ etc.) try to stay close to the dataset, rather than trust the raw blindly.

So why do people try offline RL??

Real-world data is expensive or dangerous to collect. Logged data already exists. Think about the internet. So we want to figure out how to learn in an offline setting.

Offline RL vs Off-Policy RL?

Offline RL is always Off-Policy, by definition. However, Off-Policy doesn’t necessarily mean offline RL. Offline RL is extra hard, because it cannot explore, fixed offline dataset (unless you do something like Batch Online RL).