Offline Reinforcement Learning

Offline RL uses previously collected data without any additional data collection.

Offline RL is essentially a tug-of-war between behavioral regularization and value maximization.

Resources / readings:

Offline RL or BC? https://bair.berkeley.edu/blog/2022/04/25/rl-or-bc/

Survey:

Offline Reinforcement Learning Tutorial Review and Perspectives on Open Problems

Techniques:

Notation below from Conservative Q-Learning. We have a Replay buffer that we are learning from: $D = {(s, a, r, s^{'})_{1}, \dots}$ Policy Evaluation step via Bellman Expectation Backup $\hat{Q}_{k + 1} \leftarrow ar g min_{Q} E_{s, a, s^{'} \sim D} [(r (s, a) + γ E_{a^{'} \sim \overset{π}{^}_{k} (\cdot ∣ s^{'})} [\hat{Q}_{k} (s^{'}, a^{'})] - Q (s, a))^{2}]$

Find the $Q$ that minimizes this Bellman Error

Policy improvement

\overset{π}{^}_{k + 1} \leftarrow ar g π max E_{s \sim D, a \sim π_{k} (\cdot ∣ s)} [\hat{Q}_{k + 1} (s, a)]

Notice that we train our Q from a behavior policy out of $D$ by sampling these $(s, a, r, s^{'})$ from D. and then follow a behavior policy $π_{β}$ that tries to maximize $Q$ . However, this can get biased towards actions on Q that are OOD.

As a result, offline RL tends to overestimate the Q values.

https://chatgpt.com/share/687d9452-1790-8002-adc8-8efc3d707abd

But isn't this a problem in online RL as well???

Yes, but In online Q-learning, you can actually try those overestimated actions during exploration.

Say the Q-network hallucinates that $Q (s, a^{'})$ is high. The policy tries $a^{'}$ (via ε-greedy). If the reward is actually low, the Q-value is corrected: $Q (s, a^{'}) \leftarrow r + γ max_{a^{''}} Q (s^{'}, a^{'})$

🔁 So exploration and environment feedback act as a self-correcting mechanism.

So like it gets corrected. But in offline RL, you don’t have the benefit of controlling your exploration. You don’t get GPI.

That’s why conservative offline RL methods (like CQL, BCQ etc.) try to stay close to the dataset, rather than trust the raw $max_{a} Q (s, a)$ blindly.

So why do people try offline RL??

Real-world data is expensive or dangerous to collect. Logged data already exists. Think about the internet. So we want to figure out how to learn in an offline setting.

Offline RL vs Off-Policy RL?

Offline RL is always Off-Policy, by definition. However, Off-Policy doesn’t necessarily mean offline RL. Offline RL is extra hard, because it cannot explore, fixed offline dataset (unless you do something like Batch Online RL).

🛠️ Steven Gong

Offline Reinforcement Learning

Graph View

Backlinks