Q-Learning

Q-Learning is the Off-Policy implementation of TD Control. I first went through this with Flappy Bird Q-Learning.

See Deep Q-Learning for continuous state case.

Tabular Q-Learning

Even though it’s off-policy, we don’t need Importance Sampling!

We now consider off-policy learning of action-values $Q (s, a)$
Next action is chosen using behavior policy $a_{t + 1} \sim µ (\cdot ∣ s_{t})$
But we consider alternative successor action $a^{'} \sim π (\cdot ∣ s_{t})$
- As you’ll see below, we let $a^{'} = ar g max_{a} Q (s_{t + 1}, a)$
And update $Q (s_{t}, a_{t})$ towards value of alternative action $Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r_{t + 1} + γ Q (s_{t + 1}, a^{'}) - Q (s_{t}, a_{t})]$
Learned from here

Q-Learning Properties (Off-policy learning)

Q-Learning converges to optimal policy - even if you’re acting suboptimally.

Sarsa is the on-policy version of q-learning

The above uses the 1-step Bellman Optimality Equation

Q-learning generally uses the 1-step bellman optimality backup. You might wonder if there is a $n$ -step version, and indeed there is :)

n-step Reinforcement Learning

Off-Policy Control with Q-Learning

We now allow both behaviour and target policies to improve
The target policy $π$ is greedy with respect to $Q (s, a)$ $π (S_{t + 1}) = a^{'} argmax Q (S_{t + 1}, a^{'})$

Q-Learning Control Algorithm $Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r_{t + 1} + γ a^{'} max Q (s_{t + 1}, a^{'}) - Q (s_{t}, a_{t})]$

Double Q-Learning

🛠️ Steven Gong

Table of Contents

Q-Learning

Tabular Q-Learning

Off-Policy Control with Q-Learning

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Q-Learning

Tabular Q-Learning

Off-Policy Control with Q-Learning

Related

Graph View

Backlinks