Actor-Critic Methods

Deep Deterministic Policy Gradient (DDPG)

More sample efficient. Though mostly replaced by TD3 since its just objectively better. DDPG can be thought of as being DQN for continuous action spaces.

Why is it called DDPG?

Because the policy that is learned is deterministic.

Resources

Algorithms like DDPG and Q-Learning are off-policy, so they are able to reuse old data very efficiently. They gain this benefit by exploiting Bellman’s equations for optimality, which a Q-function can be trained to satisfy using any environment interaction data.

DDPG interleaves learning an approximator for with learning an approximator to .

Mean Squared Bellman Error:

In practice however, we deal with continuous action spaces, taking the max is not so obvious. So we learn a policy network to predict how to maximize Q:

For updating the policy , we just do gradient ascent to maximize :

Notes on Implementation

You will notice that the network will learn an MLP to predict both mean and variance.

  • Not to be confused with DDPM