# Q-Learning

Q-Learning is the off-policy implementation of TD Control. I first went through this with Flappy Bird Q-Learning.

Even though it’s off-policy, we don’t need Importance Sampling!

- We now consider off-policy learning of action-values $Q(s,a)$
- Next action is chosen using behaviour policy $A_{t+1}∼µ(⋅∣S_{t})$
- But we consider alternative successor action $A_{′}∼π(⋅∣S_{t})$
- And update $Q(S_{t},A_{t})$ towards value of alternative action $Q(S_{t},A_{t})←Q(S_{t},A_{t})+α(R_{t+1}+γQ(S_{t+1},A_{′})−Q(S_{t},A_{t}))$

Q-Learning Properties (Off-policy learning)

Q-Learning converges to optimal policy - even if you’re acting suboptimally.

### Off-Policy Control with Q-Learning

- We now allow both behaviour and target policies to improve
- The target policy $π$ is greedy with respect to $Q(s,a)$ $π(S_{t+1})=a_{′}argmax Q(S_{t+1},a_{′})$

Q-Learning Control Algorithm $Q(S,A)←Q(S,A)+α(R+γa_{′}max Q(S_{′},a_{′})−Q(S,A))$

From Pieter Abbeel Foundation of Deep RL:

- the notation hear is much clearer, the $k$ subscript shows how the value of Q updates over each iteration, similarly see Value Iteration

### Related

When there are too many states, to generate the table Q(s,a), we can try Deep Q-Learning. Also see Double Q-Learning.