Trust Region Policy Optimization (TRPO)

TRPO updates policies by taking the largest step possible to improve performance, while satisfying a special constraint on how close the new and old policies are allowed to be.

Resources

https://spinningup.openai.com/en/latest/algorithms/trpo.html
- Implementation here
OG paper https://proceedings.mlr.press/v37/schulman15.html

In Vanilla Policy Gradient, we just do a gradient update of the weights. $π_{θ} (s ∣ a)$ can change drastically between each step if you think about the parameter landscape.

TRPO nicely avoids this kind of collapse, and tends to quickly and monotonically improve performance by adding a KL Divergence constraint, which forces $π_{n e w} (a_{t} ∣ s_{t})$ to be similar to $π_{o l d} (a_{t} ∣ s_{t})$ .

I was thinking looking at the VPG spinning up implementation that they would just use log_p - old_log_p, and then can use this alpha as a regularization term. However, it doesn’t seem like they do that.

$θ_{k + 1} = ar g max_{θ} L (θ_{k}, θ) s.t. \overset{ˉ}{D}_{K L} (θ ∣∣ θ_{k}) \leq δ$ The goal is the find the best parameters that maximizes the surrogate advantage that satisfies the KL divergence constraint.

The surrogate advantage: $L (θ_{k}, θ) = E_{s, a \sim π_{θ_{k}}} [\frac{π _{θ} ( a ∣ s )}{π _{θ_{k}} ( a ∣ s )} A^{π_{θ_{k}}} (s, a)]$

And KL Divergence constraint: $\overset{ˉ}{D}_{K L} (θ ∣∣ θ_{k}) = E_{s \sim π_{θ_{k}}} [D_{K L} (π_{θ} (\cdot ∣ s) ∣∣ π_{θ_{k}} (\cdot ∣ s))]$

🛠️ Steven Gong

Table of Contents

Trust Region Policy Optimization (TRPO)

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Trust Region Policy Optimization (TRPO)

Related

Graph View

Backlinks