Policy Extraction

Policy extraction is the problem of extracting a policy from a Value Function when you’re doing Value-Based RL. This problem is only relevant in continuous action space settings.

In discrete action space settings, the policy extraction problem is trivial: $π (s) = a r g ma x_{a} Q (s, a)$ However, when working in the continuous space, this is an open problem, and depends on how the policy is parametrized.

In DDPG, we simply train $π (s)$ to maximize $Q (s, π (s))$ , where $π$ is some basic MLP
In SAC, we use a Gaussian Policy and use the Reparametrization Trick to maximize $Q (s, π (a ∣ s))$ (so that the gradients can a flow to the parameters of the gaussian policy). Okay there’s also Entropy Regularization as a maximizer objective but out of scope

In Offline RL, this is even furthermore a problem because you want to stay close to the behavior policy, else you end up exploiting Q.

Many techniques like IQL, CQL, Cal-QL try to address this

Explicit vs. implicit policy extraction

First ran into this term while going through Batch Online RL paper.

Explicit policy extraction: This approach has the advantage of explicitly learning on signals from the Q-function, while still making the policy stay close to the behavior dataset.

Implicit policy extraction:

You simply do multi-trajectory sampling

“While implicit policy extraction loses potentially useful signals from the Q-function for the policy, it has the advantage of disentangling the value function and policy training, which provides more stable learning”.

🛠️ Steven Gong

Table of Contents

Policy Extraction

Explicit vs. implicit policy extraction

Policy-extraction in terms of on-policy vs. off-policy?

Graph View

Backlinks