Temporal-Difference Learning

Intuition: Don’t wait until the end of every episode each time to update. Only wait until the next time step.

TD learns from incomplete episodes, by Bootstrapping
TD updates a guess towards a guess

Because of this, unlike the Monte-Carlo Learning,

TD can learn before knowing the final outcome
TD can learn without the final outcome

TD exploits the MDP structure.

TD Policy Evaluation

Goal: learn $v_{π}$ online from experience under policy $π$

Simplest TD learning algorithm TD(0): Update value $V (S_{t}$ ) towards the TD Target ( $R_{t + 1} + γV (S_{t + 1})$ ) $V (S_{t}) = V (S_{t}) + α δ_{t}$ where $δ$ is the TD error between the estimated returns: $δ_{t} = (R_{t + 1} + γV (S_{t + 1})) - V (S_{t})$

Notice the similarity/difference with Monte-Carlo Learning. We just replace $G_{t}$ with the Bellman Expectation Backup, $G_{t} = R_{t + 1} + γV (S_{t + 1})$ . Because we do this, the problem has to be MDP. This isn’t mandatory for Monte-Carlo.

How a TD backup looks like:

TD( $λ$ )

Combines the best out of both world. See Bias - Variance Tradeoff.

TD( $\lambda$ ) $\neq$ taking the $\lambda$ -step return

Forward TD-( $λ$ )looks most similar to $λ$ -step return, however $λ$ -return is not quite n-step Reinforcement Learning.

The λ-return $G_{t}^{λ}$ combines all n-step returns $G_{t}^{(n)}$ using weight $(1 - λ) λ^{n - 1}$ $G_{t}^{λ} = (1 - λ) \sum_{n = 1}^{\infty} λ^{n - 1} G_{t}^{(n)}$

You combine all returns into this sort of geometric sum.

You use geometric weighting so the cost is the same as computing TD(0).

There is a forward view and backward view version. Forward-view TD(λ) $V (S_{t}) \leftarrow V (S_{t}) + α (G_{t}^{λ} - V (S_{t}))$

Backward-View TD( $λ$ ) Keep an Eligibility Trace for every state $s$ . $V (s) \leftarrow V (s) + α δ_{t} E_{t} (s)$

No one talks about td-lambda nowadays?

In deep RL, TD(λ) is rarely used directly, because:

Neural networks already provide generalization, so the forward-view λ-return trick doesn’t provide the same advantage as in tabular settings.

Temporal-Difference Control

🛠️ Steven Gong

Table of Contents

Temporal-Difference Learning

TD Policy Evaluation

TD( $λ$ )

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Temporal-Difference Learning

TD Policy Evaluation

TD(λ)

Related

Graph View

Backlinks

TD( $λ$ )