Off-Policy Learning

Off-Policy Learning is the idea of evaluating target policy while following behavior policy .

title: Target Policy vs. Behavior Policy
The target policy is the policy being learned about, while the bevahior policy is the policy used to generate behavior.

Allows us to learn about the Optimal Policy while following exploratory policy. Helps us in the Exploration and Exploitation debate. Learn about multiple policies while following one policy.

On-Policy vs. Off-Policy

On-Policy methods: Direct experience. Evaluates or improves the policy that is used to make decisions. Less sample efficient. Off-Policy methods: Evaluates or improves a policy different from that used to generate the data. More sample efficient.

Almost all off-policy methods utilize Importance Sampling.

Off-policy methods

  • require additional concepts and notation compared to on policy methods, but are more powerful and general
  • are often of greater variance and are slower to converge because data is due to different policy
  • have a variety of additional uses in applications
    • can often be applied to learn from data generated by a conventional non-learning controller, or from a human expert
  • also seen by some as key to learning multi-step predictive models of the world’s dynamics

Examples of On-Policy Methods

Basically all Policy Gradient Methods, such as

Examples of Off-Policy Methods

More on Off-Policy