Advantage-Weighted Regression (AWR)
Saw this from the online batch RL paper.
- So instead of naive BC, we reweigh the dataset based on this advantage
AWR learns a policy pi(a∣s) by supervised learning on a dataset of (s,a) pairs, but weights each action by its advantage:
Learn a policy that imitates actions with high advantage, and suppresses actions with low advantage.