Sarsa

Sarsa is the implementation of On-Policy TD Control.

It is very similar to the idea we do with Monte-Carlo Control, (replace policy evaluation with TD-Learning, use the same policy improvement method with epsilon greedy).

Pseudocode

title:Convergence of Sarsa Theorem
Sarsa converges to the optimal action-value function, under the follow conditions:
- [[Greedy in the Limit with Infinite Exploration|GLIE]] sequence of policies
- Robbins-Monro sequence of step-sizes $\alpha_t$
 
$$\sum_{t=1}^\infty{\alpha_t = \infty}$$
$$\sum_{t=1}^\infty{\alpha_t^2 < \infty}$$
 
 
In practice, we don't worry about this.

Sarsa ($lambda$)

Okay, I don’t understand this, just like I don’t quite understand TD().