Q-Learning
Q-Learning is the off-policy implementation of TD Control. I first went through this with Flappy Bird Q-Learning.
Even though it’s off-policy, we don’t need Importance Sampling!
- We now consider off-policy learning of action-values
- Next action is chosen using behaviour policy
- But we consider alternative successor action
- And update towards value of alternative action
Q-Learning Properties (Off-policy learning)
Q-Learning converges to optimal policy - even if you’re acting suboptimally.
Off-Policy Control with Q-Learning
- We now allow both behaviour and target policies to improve
- The target policy is greedy with respect to
Q-Learning Control Algorithm
From Pieter Abbeel Foundation of Deep RL:
- the notation hear is much clearer, the subscript shows how the value of Q updates over each iteration, similarly see Value Iteration
Related
When there are too many states, to generate the table Q(s,a), we can try Deep Q-Learning. Also see Double Q-Learning.