Actor-Critic Policy Gradient
The point of actor critic methods is to decouple the gradient update from the q-function update.
- Policy-based methods directly optimize the policy but have high variance
- Value-based methods estimate values but don’t give a policy