Deep Deterministic Policy Gradient (DDPG)

More sample efficient. Though mostly replaced by TD3 since its just objectively better. DDPG can be thought of as being DQN for continuous action spaces.

Why is it called DDPG?

Because the policy that is learned is deterministic.

Resources

https://spinningup.openai.com/en/latest/algorithms/ddpg.html
- Implementation here
Lecture 5: DDPG and SAC from Deep RL Foundations, slides here

Algorithms like DDPG and Q-Learning are off-policy, so they are able to reuse old data very efficiently. They gain this benefit by exploiting Bellman’s equations for optimality, which a Q-function can be trained to satisfy using any environment interaction data.

DDPG interleaves learning an approximator for $Q^{*} (s, a)$ with learning an approximator to $a (s)$ .

Mean Squared Bellman Error: $L (ϕ, D) = (s, a, r, s^{'}, d) \sim D E (Q_{ϕ} (s, a) - (r + γ (1 - d) max_{a^{'}} Q_{ϕ_{t a r g e t}} (s^{'}, a^{'})))^{2}$

Notice that we take the max, so it looks like a Bellman Optimality Backup

In practice however, we deal with continuous action spaces, taking the max is not so obvious. So we learn a policy network $μ_{θ}$ to predict how to maximize Q: $L (ϕ, D) = (s, a, r, s^{'}, d) \sim D E (Q_{ϕ} (s, a) - (r + γ (1 - d) Q_{ϕ_{targ}} (s^{'}, μ_{θ_{targ}} (s^{'}))))^{2}$

For updating the policy $μ_{θ}$ , we just do gradient ascent to maximize $Q$ : $max_{θ} s \sim D E [Q_{ϕ} (s, μ_{θ} (s))]$

Notes on Implementation

You will notice that the network will learn an MLP to predict both mean and variance.

Not to be confused with DDPM

🛠️ Steven Gong

Table of Contents

Deep Deterministic Policy Gradient (DDPG)

Notes on Implementation

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Deep Deterministic Policy Gradient (DDPG)

Notes on Implementation

Related

Graph View

Backlinks