Policy Extraction

This is a term that they use in the Online Batch RL paper.

Explicit policy extraction: This approach has the advantage of explicitly learning on signals from the Q-function, while still making the policy stay close to the behavior dataset.

Implicit policy extraction:

  • You simply do multi-trajectory sampling

“While implicit policy extraction loses potentially useful signals from the Q-function for the policy, it has the advantage of disentangling the value function and policy training, which provides more stable learning”.