Q-Learning

Q-Learning is the off-policy implementation of TD Control. I first went through this with Flappy Bird Q-Learning.

Even though it’s off-policy, we don’t need Importance Sampling!

  • We now consider off-policy learning of action-values
  • Next action is chosen using behaviour policy
  • But we consider alternative successor action
  • And update towards value of alternative action

Q-Learning Properties (Off-policy learning)

Q-Learning converges to optimal policy - even if you’re acting suboptimally.

Off-Policy Control with Q-Learning

  • We now allow both behaviour and target policies to improve
  • The target policy is greedy with respect to

Q-Learning Control Algorithm

From Pieter Abbeel Foundation of Deep RL:

  • the notation hear is much clearer, the subscript shows how the value of Q updates over each iteration, similarly see Value Iteration

When there are too many states, to generate the table Q(s,a), we can try Deep Q-Learning. Also see Double Q-Learning.