Policy Gradient Methods
Class of Reinforcement Learning methods that is widely used in practice.
Instead of working with Value Functions, we directly work with the Policy.
Resources:
- Lecture 3: Policy Gradient and Advantage Estimation from Deep RL Foundation Series, slides here
I have a lot of trouble with the math derivation. But I think yea I understand? There are 2 networks:
- Policy Network: mapping states to actions
- network trained to learn the optimal policy by adjusting its weights to maximize the expected return
- Value Network / critic network : takes the current state as input and outputs a value estimate (Value Function)
then why do you need value network? doesn't that not make it policy gradient methods?
Policy gradient methods directly optimize the policy (action strategy). Adding a value network, which estimates expected returns, isn’t required but improves learning in two main ways:
- Variance Reduction: It helps to decrease the variability in policy updates, making learning more stable.
- Better Decision Making: It allows more informed decisions, balancing exploration and exploitation better.
Incorporating a value network doesn’t stop a method from being a policy gradient method. It’s still optimizing the policy directly, just with added benefits.
“Policy gradient methods work by directly computing an estimate of the gradient of policy parameters in order to maximize the expected return using stochastic gradient descent”.
As talked about above, the value network helps with reducing variance. Pieter Abbeel calls it the baseline, but it’s essentially the value function (your current estimate of how much expected reward you will get at this state).
From the Lecture 3 slides:
- So you have two neural networks which both take in states . One generates policy parameters , and the baseline generates the value function
These are great to work with because we don’t need an explicit model of the world. See https://lilianweng.github.io/posts/2018-04-08-policy-gradient/, who talks about how they’re great for continuous action spaces.
- Finite Difference Policy Gradient
- Monte-Carlo Policy Gradient
- Actor-Critic Policy Gradient
Instead of Value Function Approximation, and then generating a policy directly from the value function, we directly parameterize the policy.
Advantages:
- Better convergence properties
- Effective in high-dimensional or continuous action spaces
- Can learn stochastic policies
Disadvantages:
- Typically converge to a local rather than global optimum
- Evaluating a policy is typically inefficient and high variance
Sometimes, stochastic policies are the best.
For example, in rock-paper-scissors, if your policy was deterministic, your opponent would eventually figure it out, and you would keep losing.
Score Function
Policy Gradient Theorem
For any differentiable policy , for any of the policy objective functions , , or , the policy gradient is
- Update p
Critic Monte-Carlo policy gradient still has high variance We use a critic to estimate the action-value function, Qw (s, a) ≈ Q πθ (s, a) Actor-critic algorithms maintain two sets of parameters
- Critic Updates action-value function parameters w
- Actor Updates policy parameters θ, in direction suggested by critic