Monte-Carlo vs. Temporal Difference

David Silver talks about this Bias - Variance Tradeoff.

MC is unbiased but has high variance between updates. TD(0) is highly biased but has low variance across updates.

I always confuse the bias-variance tradeoff

You’re going to fail the interview if you keep confusing this lol. Monte-Carlo is much less biased because it doesn’t rely on bootstrapping. Variance comes form the fact that each episode might have drastically different returns.

TD( $λ$ ) tries to combine the best out of both worlds.

Where does the bias come from?

It comes from Bootstrapping, i.e. updating values off estimates. TD(0) relies a lot off bootstrapping, whereas MC does not (since it does full rollouts).

Let’s write down our equation

In TD(0) (Q-Learning), we do the update as following: $Q (s_{t}, a_{t}) \leftarrow R_{t} + γ Q (s_{t + 1}, a_{t + 1})$ Because we use $Q$ to update $Q$ , where $Q$ is an estimate of the true Q, it’s always going to be biased. Compare that to MC Learning, where we use the Return $G_{t}$ to update the value function. $Q (s, a) \leftarrow G_{t} = \sum_{k = 0}^{T - t - 1} γ^{k} R_{t + k}$

How does this work in the context of Offline RL and Off-Policy?

Like there is a layer of complexity that is added.

When you go off-policy, you introduce off-policy bias. This bias comes from the fact that $Q^{π} (s_{t + 1}, a_{t + 1})$ is different from the one you are actually learning $Q^{β} (s_{t + 1}, a_{t + 1})$ .

How does this work in the context of Off-Policy?

In offline RL, This is known as distributional shift — your Q-function generalizes to unseen actions/states and bootstraps on them, which leads to compounding bias.

Bias and Variance

Bias measures how far the expected estimate is from the true value: $B ia s [Q^{π} (s, a)] = E [Q^{π} (s, a)] - Q^{π} (s, a)$

Variance measures how much the estimates vary around their own mean, not around the true value: $Va r ian ce [Q^{π} (s, a)] = E [(Q^{π} (s, a) - E [Q^{π} (s, a)])^{2}]$

“We have seen this repeatedly, such as in Chapters 7 (Figure 7.2) and 9 (Figure 9.2), where some degree of bootstrapping performed much better than Monte Carlo methods on the random-walk prediction task, and in Chapter 10 where the same was seen on the Mountain-Car control task (Figure 10.4). Many other problems show much faster learning with bootstrapping (e.g., see Figure 12.14”

Me wanting to visualize this

Above is my basic thought experiment to better internalize this. Imagine we have an environment with only 8 states ( $s_{0}, s_{1}, s_{2}, \dots s_{7}$ ). Certain states have much higher reward variance than other states. No discount factor. After visiting each state, we get a reward of 10. So in theory, these should be the value functions:

$V (s_{0}) = 10 + 10 + 10 + 10 \dots + 10 = 70$
$V (s_{1}) = 10 + 10 + 10 + 10 \dots + 10 = 60$
$V (s_{2}) = 10 + 10 + 10 + 10 + 10 = 50$ …

However, when we step through the environment, the reward we get can be quite noisy. Sometimes, we get 10, but sometimes, we actually get 2, sometimes 30. $s_{5}$ is the most noisy.

Let’s say for a given episode rollout, we get the following rewards: $r_{1} = 10, r_{2} = 10, r_{3} = 10, r_{4} = 10, r_{5} = 20, r_{6} = 10, r_{7} = 10$ And then we get $r_{1} = 10, r_{2} = 10, r_{3} = 10, r_{4} = 10, r_{5} = 0, r_{6} = 10, r_{7} = 10$ So $G = 80$ and $G = 60$ .

MC: First time

$V (s_{0}) + = α (80 - V (s_{0}))$
$V (s_{0}) + = α (60 - V (s_{0}))$

But if we do TD:

$V (s_{0}) + = α (10 + V (s_{1}) - V (s_{0}))$
$V (s_{0}) + = α (10 + V (s_{1}) - V (s_{0}))$

That noisy update will only affect $s_{5}$

$V (s_{5}) + = α (20 + V (s_{6}) - V (s_{5}))$
$V (s_{5}) + = α (0 + V (s_{6}) - V (s_{5}))$

So really, the noise stays within $V (s_{5})$ , and it doesn’t propagate as fast to the other estimates. It’s really just smoothed out.

This reduction in variance means that even though each update is “local,” they stabilize faster. You don’t need as many samples to wash out the noise.

The issue is that the variance is additive. But we don’t have that issue if the reward is just terminal reward.

see Law of total Variance

In Sparse reward setting

The Horizon Reduction Makes RL Scalable made me think a lot about this.

What I don’t understand:

why is TD more sample efficient

🛠️ Steven Gong

Table of Contents

Monte-Carlo vs. Temporal Difference

Bias and Variance

In Sparse reward setting

Graph View

Backlinks