Off-Policy Learning

Off-Policy Learning is the idea of evaluating target policy $π (a ∣ s)$ while following behavior policy $μ (a ∣ s)$ .

Target Policy vs. Behavior Policy

The target policy is the policy being learned about, while the bevahior policy is the policy used to generate behavior.

Allows us to learn about the Optimal Policy while following exploratory policy. Helps us in the Exploration and Exploitation debate.

Learn about multiple policies while following one policy.

On-Policy vs. Off-Policy

On-Policy methods: Direct experience. Evaluates or improves the policy that is used to make decisions. Less sample efficient. Off-Policy methods: Evaluates or improves a policy different from that used to generate the data. More sample efficient.

Almost all off-policy methods utilize Importance Sampling.

Off-policy methods

require additional concepts and notation compared to on policy methods, but are more powerful and general
are often of greater variance and are slower to converge because data is due to different policy
have a variety of additional uses in applications
- can often be applied to learn from data generated by a conventional non-learning controller, or from a human expert
also seen by some as key to learning multi-step predictive models of the world’s dynamics

Examples of On-Policy Methods

Value-Based Methods

Sarsa

Policy Gradient Methods

PPO
A2C and A3C

Examples of Off-Policy Methods

Value-Based Methods

Policy Gradient Methods

Wait but like Q-learning, we only update states that we visited, so how is it off-policy?

You’re right, but The fundamental difference is that the optimal policy is not followed from that.You learn about the optimal policy using data collected from a different policy.

Ok, so how is on-policy different? I don’t quite get the nuance, so I’m going to spend more time trying to understand PPO.

🛠️ Steven Gong

Table of Contents

Off-Policy Learning

On-Policy vs. Off-Policy

Examples of On-Policy Methods

Examples of Off-Policy Methods

Graph View

Backlinks