Policy Gradient Methods

Class of Reinforcement Learning methods that is widely used in practice.

Instead of working with Value Functions, we directly work with the Policy.

Resources:

I have a lot of trouble with the math derivation. But I think yea I understand? There are 2 networks:

  1. Policy Network: mapping states to actions
    • network trained to learn the optimal policy by adjusting its weights to maximize the expected return
  2. Value Network / critic network : takes the current state as input and outputs a value estimate (Value Function)
    • used in certain policy gradient algorithms, such as A2C and PPO. The value network estimates the expected return or value of being in a particular state.
    • value estimate is used to calculate advantages or to compute the value function for policy updates

then why do you need value network? doesn't that not make it policy gradient methods?

Policy gradient methods directly optimize the policy (action strategy). Adding a value network, which estimates expected returns, isn’t required but improves learning in two main ways:

  1. Variance Reduction: It helps to decrease the variability in policy updates, making learning more stable.
  2. Better Decision Making: It allows more informed decisions, balancing exploration and exploitation better.

Incorporating a value network doesn’t stop a method from being a policy gradient method. It’s still optimizing the policy directly, just with added benefits.

“Policy gradient methods work by directly computing an estimate of the gradient of policy parameters in order to maximize the expected return using stochastic gradient descent”.

As talked about above, the value network helps with reducing variance. Pieter Abbeel calls it the baseline, but it’s essentially the value function (your current estimate of how much expected reward you will get at this state).

From the Lecture 3 slides:

  • So you have two neural networks which both take in states . One generates policy parameters , and the baseline generates the value function

These are great to work with because we don’t need an explicit model of the world. See https://lilianweng.github.io/posts/2018-04-08-policy-gradient/, who talks about how they’re great for continuous action spaces.

Instead of Value Function Approximation, and then generating a policy directly from the value function, we directly parameterize the policy.

Advantages:

  • Better convergence properties
  • Effective in high-dimensional or continuous action spaces
  • Can learn stochastic policies

Disadvantages:

  • Typically converge to a local rather than global optimum
  • Evaluating a policy is typically inefficient and high variance

Sometimes, stochastic policies are the best.

For example, in rock-paper-scissors, if your policy was deterministic, your opponent would eventually figure it out, and you would keep losing.

Score Function

Policy Gradient Theorem

For any differentiable policy , for any of the policy objective functions , , or , the policy gradient is

Monte-Carlo Policy Gradient

  • Update p

Critic Monte-Carlo policy gradient still has high variance We use a critic to estimate the action-value function, Qw (s, a) ≈ Q πθ (s, a) Actor-critic algorithms maintain two sets of parameters

  • Critic Updates action-value function parameters w
  • Actor Updates policy parameters θ, in direction suggested by critic