Advantage-Weighted Regression (AWR)
Saw this from the online batch RL paper.
AWR learns a policy pi(a∣s) by supervised learning on a dataset of (s,a) pairs, but weights each action by its advantage:
Learn a policy that imitates actions with high advantage, and suppresses actions with low advantage.