Advantage-Weighted Regression (AWR)

Saw this from the Batch Online RL paper.

  • So instead of naive BC, we reweigh the dataset based on this advantage

Notice that this equation is actually really similar to Vanilla Policy Gradient, except that we use exponential averaging.

AWR learns a policy pi(a∣s) by supervised learning on a dataset of (s,a) pairs, but weights each action by its advantage:

Learn a policy that imitates actions with high advantage, and suppresses actions with low advantage.