TD Control

Q-Learning

Q-Learning is the Off-Policy implementation of TD Control. I first went through this with Flappy Bird Q-Learning.

Even though it’s off-policy, we don’t need Importance Sampling!

  • We now consider off-policy learning of action-values
  • Next action is chosen using behavior policy
  • But we consider alternative successor action
    • As you’ll see below, we let
  • And update towards value of alternative action
  • Learned from here

Q-Learning Properties (Off-policy learning)

Q-Learning converges to optimal policy - even if you’re acting suboptimally.

  • Sarsa is the on-policy version of q-learning

Off-Policy Control with Q-Learning

  • We now allow both behaviour and target policies to improve
  • The target policy is greedy with respect to

Q-Learning Control Algorithm

From Pieter Abbeel Foundation of Deep RL:

  • the notation hear is much clearer, the subscript shows how the value of Q updates over each iteration, similarly see Value Iteration

When there are too many states, to generate the table Q(s,a), we can try Deep Q-Learning. Also see Double Q-Learning.