Trust Region Policy Optimization (TRPO)
TRPO updates policies by taking the largest step possible to improve performance, while satisfying a special constraint on how close the new and old policies are allowed to be.
This constraint is expressed in terms of the KL Divergence.
Resources
- https://spinningup.openai.com/en/latest/algorithms/trpo.html
- OG paper https://proceedings.mlr.press/v37/schulman15.html