Policy Gradient Methods

Proximal Policy Optimization (PPO)

PPO is very similarly motivated as TRPO, except that instead of using second-order methods, it uses first-order methods to constrain the policy update.

Resources

Other (Explained well from the spiderman video)

Good reasons for using PPO:

  • its comparatively high data efficiency
  • its ability to cope with various kinds of action spaces
  • its robust learning performance

PPO updates its policies via And we do gradient ascent to maximize this objective

  • is a (small) hyperparameter which roughly says how far away the new policy is allowed to go from the old.

Why couldn't TRPO do this gradient ascent thingy?

Because TRPO requires Constrained Optimization, which is quite computationally expensive.

PPO gets around this by using the clipped surrogate objective