Vanilla Policy Gradient (VPG)
Resources:
Why is there a value function, if we know how to optimize the policy directly??
Add a value network, which estimates expected returns, isn’t required but improves learning in two main ways:
- Variance Reduction: It helps to decrease the variability in policy updates, making learning more stable.
- Better Decision Making: It allows more informed decisions, balancing exploration and exploitation better.