Policy Gradient Methods

Class of Reinforcement Learning methods that is widely used in practice where we directly optimize the policy, as opposed to learning a Q-function (Value-Based Methods).

That’s not fully true, because it seems like they still use the poilcy

Classes of policy gradient methods:

Did the derivation during this livestream: https://www.youtube.com/watch?v=rCyMJFKv_qA

Make sure to understand this derivation (refer to more steven gong): We go from $\nabla_{θ} J (π_{θ}) = \nabla_{θ} E_{τ \sim π_{θ}} [ϕ_{t}]$ $\nabla_{θ} J (π_{θ}) = E_{τ \sim π_{θ}} [\sum_{t}^{T} \nabla_{θ} lo g (π (s_{t} ∣ a_{t})) ϕ_{t}]$ where:

$τ$ is a rollout
$π_{θ}$ is a parameterized policy
$ϕ_{t}$ is some advantage estimator, comes from Generalized Advantage Estimation

We can then update our parameters $θ = θ + α \nabla J (π_{θ})$

Resources:

Spinning up part 3
Lecture 3: Policy Gradient and Advantage Estimation from Deep RL Foundation Series, slides here

“Policy gradient methods work by directly computing an estimate of the gradient of policy parameters in order to maximize the expected return using stochastic gradient descent”.

As talked about above, the value network helps with reducing variance. Pieter Abbeel calls it the baseline, but it’s essentially the value function (your current estimate of how much expected reward you will get at this state).

From the Lecture 3 slides:

So you have two neural networks which both take in states $s_{t}$ . One generates policy parameters $θ$ , and the baseline generates the value function

Instead of Value Function Approximation, and then generating a policy directly from the value function, we directly parameterize the policy.

$π_{θ} (s, a) = P [a ∣ s, θ]$

Advantages:

Better convergence properties
Effective in high-dimensional or continuous action spaces
Can learn stochastic policies

Disadvantages:

Typically converge to a local rather than global optimum
Evaluating a policy is typically inefficient and high variance

Sometimes, stochastic policies are the best.

For example, in rock-paper-scissors, if your policy was deterministic, your opponent would eventually figure it out, and you would keep losing.

Score Function

$\nabla_{θ} lo g π_{θ} (s, a)$

Policy Gradient Theorem

For any differentiable policy $π_{θ} (s, a)$ , for any of the policy objective functions $J = J_{1}$ , $J_{a v R}$ , or $\frac{1}{1 - γ} J a v V$ , the policy gradient is

$\nabla_{θ} J (θ) = E π_{θ} [\nabla_{θ} lo g π_{θ} (s, a) Q^{π_{θ}} (s, a)]$

Critic Monte-Carlo policy gradient still has high variance We use a critic to estimate the action-value function, Qw (s, a) ≈ Q πθ (s, a) Actor-critic algorithms maintain two sets of parameters

Critic Updates action-value function parameters w
Actor Updates policy parameters θ, in direction suggested by critic

🛠️ Steven Gong

Table of Contents

Policy Gradient Methods

Score Function

Graph View

Backlinks