Trust Region Policy Optimization (TRPO)
TRPO updates policies by taking the largest step possible to improve performance, while satisfying a special constraint on how close the new and old policies are allowed to be.
This constraint is expressed in terms of the KL Divergence.
Resources