Policy Extraction
This is a term that they use in the Online Batch RL paper.
Explicit policy extraction: This approach has the advantage of explicitly learning on signals from the Q-function, while still making the policy stay close to the behavior dataset.
Implicit policy extraction:
- You simply do multi-trajectory sampling
“While implicit policy extraction loses potentially useful signals from the Q-function for the policy, it has the advantage of disentangling the value function and policy training, which provides more stable learning”.