Monte-Carlo vs. Temporal Difference

David Silver talks about this Bias - Variance Tradeoff.

In MC, we have low bias, but high variance. In TD(0), we have high bias, but low variance.

TD() tries to combine the best out of both worlds.

Where does the bias come from? It comes from Bootstrapping. TD(0) relies a lot off bootstrapping, whereas MC does not.

Let’s write down our equation

In TD(0) (Q-Learning), we do the update as following: Because we use to update , where is an estimate of the true Q, it’s always going to be biased. Compare that to MC Learning, where we use the Return to update the value function.

How does this work in the context of Offline RL and Off-Policy?

Like there is a layer of complexity that is added.

When you go off-policy, you introduce off-policy bias. This bias comes from the fact that is different from the one you are actually learning .

How does this work in the context of Off-Policy?

In offline RL, This is known as distributional shift — your Q-function generalizes to unseen actions/states and bootstraps on them, which leads to compounding bias.

Bias and Variance

Bias measures how far the expected estimate is from the true value:

Variance measures how much the estimates vary around their own mean, not around the true value:

Explained more in detail

Longer returns (higher n) include more real rewards and less bootstrapping, so:

  • They reduce variance from using noisy estimates of .

You’re relying more on observed rewards.

❌ Bias The bias comes not from overfitting to a single episode, but from off-policy mismatch and bootstrapping from stale or wrong values. Specifically:

If your policy changes, the tail value ​ may no longer align with the data used.

If you do n-step return naïvely, you’re assuming that the trajectory you’re using is “on-policy” — which often it’s not.

So the bias isn’t from overfitting to the episode’s rewards — those are real. It’s that you’re backing up from a part of the trajectory that might not reflect your current policy.