Off-Policy Learning
Off-Policy Learning is the idea of evaluating target policy while following behavior policy .
Allows us to learn about the Optimal Policy while following exploratory policy. Helps us in the Exploration and Exploitation debate. Learn about multiple policies while following one policy.
On-Policy vs. Off-Policy
On-Policy methods: Direct experience. Evaluates or improves the policy that is used to make decisions. Less sample efficient. Off-Policy methods: Evaluates or improves a policy different from that used to generate the data. More sample efficient.
Almost all off-policy methods utilize Importance Sampling.
Off-policy methods
- require additional concepts and notation compared to on policy methods, but are more powerful and general
- are often of greater variance and are slower to converge because data is due to different policy
- have a variety of additional uses in applications
- can often be applied to learn from data generated by a conventional non-learning controller, or from a human expert
- also seen by some as key to learning multi-step predictive models of the world’s dynamics
Examples of On-Policy Methods
Basically all Policy Gradient Methods, such as