Deep Deterministic Policy Gradient (DDPG)
More sample efficient.
Why is it called DDPG?
Because the policy that is learned is determistic
Resources
- https://spinningup.openai.com/en/latest/algorithms/ddpg.html
- Lecture 5: DDPG and SAC from Deep RL Foundations, slides here
Deeply connected to Q-Learning.
Algorithms like DDPG and Q-Learning are off-policy, so they are able to reuse old data very efficiently. They gain this benefit by exploiting Bellman’s equations for optimality, which a Q-function can be trained to satisfy using any environment interaction data.
DDPG interleaves learning an approximator for with learning an approximator to .
- Notice that we take the max, so it looks like a Bellman Optimality Backup
In practice however, we deal with continous action spaces, taking the max is not so obvious. So we learn a policy network to predict how to maximize Q.
That’s done with gradient ascent: Combined, we have:
Related
- Not to be confused with DDPM