Policy Extraction
Policy extraction is the problem of extracting a policy from a Value Function when you’re doing Value-Based RL. This problem is only relevant in continuous action space settings.
In discrete action space settings, the policy extraction problem is trivial: However, when working in the continuous space, this is an open problem, and depends on how the policy is parametrized.
- In DDPG, we simply train to maximize , where is some basic MLP
- In SAC, we use a Gaussian Policy and use the Reparametrization Trick to maximize (so that the gradients can a flow to the parameters of the gaussian policy). Okay there’s also Entropy Regularization as a maximizer objective but out of scope
In Offline RL, this is even furthermore a problem because you want to stay close to the behavior policy, else you end up exploiting Q.
Explicit vs. implicit policy extraction
First ran into this term while going through Batch Online RL paper.
Explicit policy extraction: This approach has the advantage of explicitly learning on signals from the Q-function, while still making the policy stay close to the behavior dataset.
Implicit policy extraction:
- You simply do multi-trajectory sampling
“While implicit policy extraction loses potentially useful signals from the Q-function for the policy, it has the advantage of disentangling the value function and policy training, which provides more stable learning”.