Offline Reinforcement Learning
Offline RL uses previously collected data without any additional data collection.
Offline RL is always Off-Policy, by definition. However, Off-Policy doesn’t necessarily mean offline RL. So offline RL is extra hard, because it can suffer from exploration more easily.
Techniques:
The problem:
We have a Replay buffer that we are learning from: We use this replay buffer to do our Bellman Expectation Backup However,
https://chatgpt.com/share/687d9452-1790-8002-adc8-8efc3d707abd
But isn't this a problem in online RL as well???
Yes, but In online Q-learning, you can actually try those overestimated actions during exploration.
Say the Q-network hallucinates that is high. The policy tries (via ε-greedy). If the reward is actually low, the Q-value is corrected:
🔁 So exploration and environment feedback act as a self-correcting mechanism.
So like it gets corrected. But in offline RL, you don’t have the benefit of controlling your exploration. You don’t get GPI.
That’s why conservative offline RL methods (like CQL, BCQ, etc.) try to stay close to the dataset, rather than trust the raw blindly.
So why do people try offline RL??
Real-world data is expensive or dangerous to collect. Logged data already exists. Think about the internet. So we want to figure out how to learn in an offline setting.