SARSA

SARSA is the implementation of On-Policy TD Control. Off-policy implementation of TD Control is Q-Learning.

It is very similar to the idea we do with Monte-Carlo Control, (replace policy evaluation with TD-Learning, use the same policy improvement method with epsilon greedy).

$Q^{π} (s_{t}, a_{t}) \leftarrow r_{t} + γ Q^{π} (s_{t + 1}, a_{t + 1})$

Pseudocode

Convergence of Sarsa Theorem

Sarsa converges to the optimal action-value function, under the follow conditions:

GLIE sequence of policies

Robbins-Monro sequence of step-sizes $α_{t}$

$\sum_{t = 1}^{\infty} α_{t} = \infty$ $\sum_{t = 1}^{\infty} α_{t}^{2} < \infty$

In practice, we don’t worry about this.

How does SARSA actually converge?

Like if you start from a very shitty policy, SARSA is just going to learn $Q^{π} (s, a)$ , not $Q^{*} (s, a)$ , so how are you going to get a good policy?

The answer lies in the Policy Improvement image above, Generalized Policy Iteration

Sarsa ( $λ$ )

I guess there’s a Sarsa ( $λ$ ) version similar to how there’s a TD-Lambda version.

🛠️ Steven Gong

Table of Contents

SARSA

Sarsa ( $λ$ )

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

SARSA

Sarsa (λ)

Graph View

Backlinks

Sarsa ( $λ$ )