Policy Gradient Methods

Deep Deterministic Policy Gradient (DDPG)

More sample efficient.

Why is it called DDPG?

Because the policy that is learned is determistic

Resources

Deeply connected to Q-Learning.

Algorithms like DDPG and Q-Learning are off-policy, so they are able to reuse old data very efficiently. They gain this benefit by exploiting Bellman’s equations for optimality, which a Q-function can be trained to satisfy using any environment interaction data.

DDPG interleaves learning an approximator for with learning an approximator to .

Mean Squared Bellman Error:

In practice however, we deal with continous action spaces, taking the max is not so obvious. So we learn a policy network to predict how to maximize Q.

That’s done with gradient ascent: Combined, we have:

  • Not to be confused with DDPM