Advantage-Weighted Regression (AWR) Saw this from the online batch RL paper. θ∗=argmaxθE(s,a)∼D[eβ(Q(s,a)−V(s))logπθ(a∣s)] https://arxiv.org/abs/1910.00177