Policy Gradient Methods

Vanilla Policy Gradient (VPG)

Resources:

Why is there a value function, if we know how to optimize the policy directly?

Add a value network, which estimates expected returns, isn’t required but improves learning in two main ways:

  1. Variance Reduction: It helps to decrease the variability in policy updates, making learning more stable.
  2. Better Decision Making: It allows more informed decisions, balancing exploration and exploitation better.

Refer to Generalized Advantage Estimation, if you just used the return of the trajectory, that would be a valid formulation too, but empirically slower?

Spent a little time trying to understand the implementation. The buffer is crucial to get right it seems:

  • At the end of very iteration, the replay buffer is reset
  • All the other policy gradient methods follow this

VPG is just behavior cloning multiplied by advantage!

I learned VPG after learning about BC! And I was like these equations are awfully similar.

  • VPG is on-policy weighted BC
  • AWR is off-policy weighted BC

More implementation Details

Looking at implementation here and taking some notes.

Adam Optimizer used:

# Set up optimizers for policy and value function
pi_optimizer = Adam(ac.pi.parameters(), lr=pi_lr)
vf_optimizer = Adam(ac.v.parameters(), lr=vf_lr)

Something that is interesting here is that they use an MLP to learn (mean), but for variance , they use a single parameter that is state independent. When you get to SAC, you will see that this is not the case

class MLPGaussianActor(Actor):
    def __init__(self, obs_dim, act_dim, hidden_sizes, activation):
        super().__init__()
        log_std = -0.5 * np.ones(act_dim, dtype=np.float32)
        self.log_std = torch.nn.Parameter(torch.as_tensor(log_std))
        self.mu_net = mlp([obs_dim] + list(hidden_sizes) + [act_dim], activation)