Policy Gradient Methods

Trust Region Policy Optimization (TRPO)

TRPO updates policies by taking the largest step possible to improve performance, while satisfying a special constraint on how close the new and old policies are allowed to be.

This constraint is expressed in terms of the KL Divergence.

Resources