Off-Policy Learning
Off-Policy Learning is the idea of evaluating target policy while following behavior policy .
title: Target Policy vs. Behavior Policy
The target policy is the policy being learned about, while the bevahior policy is the policy used to generate behavior.
Allows us to learn about the Optimal Policy while following exploratory policy. Helps us in the Exploration and Exploitation debate. Learn about multiple policies while following one policy.
On-Policy vs. Off-Policy
On-Policy methods: Direct experience. Evaluates or improves the policy that is used to make decisions. Less sample efficient. Off-Policy methods: Evaluates or improves a policy different from that used to generate the data. More sample efficient.
Almost all off-policy methods utilize Importance Sampling.
Off-policy methods
- require additional concepts and notation compared to on policy methods, but are more powerful and general
- are often of greater variance and are slower to converge because data is due to different policy
- have a variety of additional uses in applications
- can often be applied to learn from data generated by a conventional non-learning controller, or from a human expert
- also seen by some as key to learning multi-step predictive models of the world’s dynamics
Examples of On-Policy Methods
Basically all Policy Gradient Methods, such as