Reinforcement Learning

Policy Gradient Methods

Class of Reinforcement Learning methods that is widely used in practice where we directly optimize the policy, as opposed to learning a Q-function.

  • That’s not fully true, because it seems like they still use the poilcy

Classes of policy gradient methods:

Make sure to understand this derivation (refer to more steven gong): where:

We can then update our parameters

Resources:

“Policy gradient methods work by directly computing an estimate of the gradient of policy parameters in order to maximize the expected return using stochastic gradient descent”.

As talked about above, the value network helps with reducing variance. Pieter Abbeel calls it the baseline, but it’s essentially the value function (your current estimate of how much expected reward you will get at this state).

From the Lecture 3 slides:

  • So you have two neural networks which both take in states . One generates policy parameters , and the baseline generates the value function

Instead of Value Function Approximation, and then generating a policy directly from the value function, we directly parameterize the policy.

Advantages:

  • Better convergence properties
  • Effective in high-dimensional or continuous action spaces
  • Can learn stochastic policies

Disadvantages:

  • Typically converge to a local rather than global optimum
  • Evaluating a policy is typically inefficient and high variance

Sometimes, stochastic policies are the best.

For example, in rock-paper-scissors, if your policy was deterministic, your opponent would eventually figure it out, and you would keep losing.

Score Function

Policy Gradient Theorem

For any differentiable policy , for any of the policy objective functions , , or , the policy gradient is

Critic Monte-Carlo policy gradient still has high variance We use a critic to estimate the action-value function, Qw (s, a) ≈ Q πθ (s, a) Actor-critic algorithms maintain two sets of parameters

  • Critic Updates action-value function parameters w
  • Actor Updates policy parameters θ, in direction suggested by critic