🛠️ Steven Gong

Search

Jul 21, 2025, 1 min read

Policy Gradient Methods

Proximal Policy Optimization (PPO)

PPO is very similarly motivated as TRPO, except that instead of using second-order methods, it uses first-order methods.

Resources

Lecture 4: TRPO, PPO from Deep RL Foundations, slides here
https://lilianweng.github.io/posts/2018-04-08-policy-gradient/
https://openai.com/blog/openai-baselines-ppo/
https://arxiv.org/pdf/1707.06347.pdf

Other (Explained well from the spiderman video)

https://huggingface.co/blog/deep-rl-ppo#the-clipped-part-of-the-clipped-surrogate-objective-function
Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization (explanation paper)
https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/

Good reasons for using PPO:

its comparatively high data efficiency
its ability to cope with various kinds of action spaces
its robust learning performance

There are 2 networks being trained:

Value Network
Policy Network

This kind of sounds like the 2 different networks used in GANs.

Graph View

Backlinks

F1TENTH Research Logs
F1TENTH
Gaussian Policy
MAESTRO Paper
Monte-Carlo Policy Gradient (REINFORCE)
Off-Policy Methods
Policy Gradient Methods
Reinforcement Learning (RL)
Trust Region Policy Optimization (TRPO)

Created with Quartz, © 2025

Blog
LinkedIn
Twitter
GitHub