🛠️ Steven Gong

Search

Advantage-Weighted Regression (AWR)
Related

Feb 11, 2026, 1 min read

Advantage-Weighted Regression (AWR)

Saw this from the Batch Online RL paper.

$θ^{*} = ar g max_{θ} E_{(s, a) \sim D} [e^{β (Q (s, a) - V (s))} lo g π_{θ} (a ∣ s)]$

So instead of naive BC, we reweigh the dataset based on this advantage

Notice that this equation is actually really similar to Vanilla Policy Gradient, except that we use exponential averaging.

AWR learns a policy pi(a∣s) by supervised learning on a dataset of (s,a) pairs, but weights each action by its advantage:

Learn a policy that imitates actions with high advantage, and suppresses actions with low advantage.

Related

Policy Extraction

Graph View

Backlinks

Importance Sampling
Vanilla Policy Gradient (VPG)
Maximum a Posteriori Policy Optimisation (MPO)
Offline Reinforcement Learning with Implicit Q-Learning

Created with Quartz, © 2026

Blog
LinkedIn
Twitter
GitHub