Offline Reinforcement Learning with Implicit Q-Learning (IQL)
In this work, we primarily consider the Implicit Q-Learning (IQL) [20] value objectives given its effectiveness on a range of tasks. IQL aims to fit a value function by estimating expectiles τ with respect to actions within the support of the data, and then uses the value function to update the
Q-function. To do so, it aims to minimize the following objectives for learning a parameterized Q-function (with target Q-function ) and value function : L_V (ψ) = E_{(s,a)∼D} [L^τ_2 (Q_\widehat{\theta} (s, a) − V_ψ(s))]
- where .
- The target Q_\widehat\theta is a delayed version of via Polyak Averaging
For the policy, use AWR
L_\pi(\phi) = \mathbb{E}_{(s,a) \sim D} \left[ e^{\beta (Q_\widehat\theta(s,a) - V_\psi(s))} \log \pi_\phi(a|s) \right]
It learns a Q-function from the offline dataset.
It derives a value function by “aggregating” the learned Q-function across actions.
It then trains the policy to imitate the best actions in the dataset (using advantage-weighted updates).