Monte-Carlo vs. Temporal Difference

David Silver talks about this Bias - Variance Tradeoff.

In MC, we have low bias, but high variance. In TD(0), we have high bias, but low variance.

I always confuse the bias-variance tradeoff

You’re going to fail the interview if you keep confusing this lol. Monte-Carlo is much less biased because it doesn’t rely on bootstrapping. Variance comes form the fact that each episode might have drastically different returns.

TD() tries to combine the best out of both worlds.

Where does the bias come from?

It comes from Bootstrapping, i.e. updating values off estimates. TD(0) relies a lot off bootstrapping, whereas MC does not (since it does full rollouts).

Let’s write down our equation

In TD(0) (Q-Learning), we do the update as following: Because we use to update , where is an estimate of the true Q, it’s always going to be biased. Compare that to MC Learning, where we use the Return to update the value function.

How does this work in the context of Offline RL and Off-Policy?

Like there is a layer of complexity that is added.

When you go off-policy, you introduce off-policy bias. This bias comes from the fact that is different from the one you are actually learning .

How does this work in the context of Off-Policy?

In offline RL, This is known as distributional shift β€” your Q-function generalizes to unseen actions/states and bootstraps on them, which leads to compounding bias.

Bias and Variance

Bias measures how far the expected estimate is from the true value:

Variance measures how much the estimates vary around their own mean, not around the true value: