Implicit Q-Learning

Saw from this paper https://arxiv.org/pdf/2505.08078

Original paper:

https://arxiv.org/pdf/2110.06169 - OFFLINE REINFORCEMENT LEARNING WITH IMPLICIT Q-LEARNING
https://arxiv.org/pdf/2304.10573 - IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
- they mention implicit policy extraction here

In this work, we primarily consider the Implicit Q-Learning (IQL) [20] value objectives given its effectiveness on a range of tasks. IQL aims to fit a value function by estimating expectiles τ with respect to actions within the support of the data, and then uses the value function to update the

Q-function. To do so, it aims to minimize the following objectives for learning a parameterized Q-function $Q_{ϕ}$ (with target Q-function Qϕ′ ) and value function Vψ: $L_{Q} (ϕ) = E_{(s, a, r, s^{'}) \sim D} [(r + γ V_{θ} (s^{'}) - Q_{ϕ} (s, a))^{2}]$ $L_{V} (ψ) = E_{(s, a) \sim D} [L_{2}^{τ} (Q_{ϕ}^{'} (s, a) - V_{ψ} (s))]$

where $L_{2}^{τ} (x) = ∣ τ - 1 (x < 0) ∣ x^{2}$ .

Original paper:

https://arxiv.org/pdf/2110.06169

It learns a Q-function from the offline dataset.

It derives a value function $V (s)$ by “aggregating” the learned Q-function across actions.

It then trains the policy to imitate the best actions in the dataset (using advantage-weighted updates).

🛠️ Steven Gong

Implicit Q-Learning

Graph View

Backlinks