Policy Gradient Methods

Vanilla Policy Gradient (VPG)

Resources:

Why is there a value function, if we know how to optimize the policy directly??

Add a value network, which estimates expected returns, isn’t required but improves learning in two main ways:

  1. Variance Reduction: It helps to decrease the variability in policy updates, making learning more stable.
  2. Better Decision Making: It allows more informed decisions, balancing exploration and exploitation better.