Advantage-Weighted Regression (AWR)

Saw this from the online batch RL paper.

AWR learns a policy pi(a∣s) by supervised learning on a dataset of (s,a) pairs, but weights each action by its advantage:

Learn a policy that imitates actions with high advantage, and suppresses actions with low advantage.