Monte-Carlo vs. Temporal Difference

David Silver talks about this Bias - Variance Tradeoff.

MC is unbiased but has high variance between updates. TD(0) is highly biased but has low variance across updates.

I always confuse the bias-variance tradeoff

You’re going to fail the interview if you keep confusing this lol. Monte-Carlo is much less biased because it doesn’t rely on bootstrapping. Variance comes form the fact that each episode might have drastically different returns.

TD() tries to combine the best out of both worlds.

Where does the bias come from?

It comes from Bootstrapping, i.e. updating values off estimates. TD(0) relies a lot off bootstrapping, whereas MC does not (since it does full rollouts).

Let’s write down our equation

In TD(0) (Q-Learning), we do the update as following: Because we use to update , where is an estimate of the true Q, it’s always going to be biased. Compare that to MC Learning, where we use the Return to update the value function.

How does this work in the context of Offline RL and Off-Policy?

Like there is a layer of complexity that is added.

When you go off-policy, you introduce off-policy bias. This bias comes from the fact that is different from the one you are actually learning .

How does this work in the context of Off-Policy?

In offline RL, This is known as distributional shift β€” your Q-function generalizes to unseen actions/states and bootstraps on them, which leads to compounding bias.

Bias and Variance

Bias measures how far the expected estimate is from the true value:

Variance measures how much the estimates vary around their own mean, not around the true value:

β€œWe have seen this repeatedly, such as in Chapters 7 (Figure 7.2) and 9 (Figure 9.2), where some degree of bootstrapping performed much better than Monte Carlo methods on the random-walk prediction task, and in Chapter 10 where the same was seen on the Mountain-Car control task (Figure 10.4). Many other problems show much faster learning with bootstrapping (e.g., see Figure 12.14”

Me wanting to visualize this

Above is my basic thought experiment to better internalize this. Imagine we have an environment with only 8 states (). Certain states have much higher reward variance than other states. No discount factor. After visiting each state, we get a reward of 10. So in theory, these should be the value functions:

  • …

However, when we step through the environment, the reward we get can be quite noisy. Sometimes, we get 10, but sometimes, we actually get 2, sometimes 30. is the most noisy.

Let’s say for a given episode rollout, we get the following rewards: And then we get So and .

MC: First time

But if we do TD:

That noisy update will only affect

So really, the noise stays within , and it doesn’t propagate as fast to the other estimates. It’s really just smoothed out.

This reduction in variance means that even though each update is β€œlocal,” they stabilize faster. You don’t need as many samples to wash out the noise.

The issue is that the variance is additive. But we don’t have that issue if the reward is just terminal reward.

In Sparse reward setting

The Horizon Reduction Makes RL Scalable made me think a lot about this.

What I don’t understand:

  • why is TD more sample efficient

Bias-Variance Tradeoff for ML Models

In ML, when we talk about bias and vairance, we talk about the actual model.

If we train a 100 different instantiations of a model on 100 different datasets from the same data distribution, do all models predict similarly dont?

Where does high variance come from?

  • High variance comes from instability: small changes in the training data lead to big changes in the learned model. That instability usually comes from some combo of these:

β€œI’d look at a train/validation split and learning curves. If train error is high and close to validation, that’s high bias/underfitting β†’ I’d add capacity, improve features, or reduce regularization. If train error is low but validation is much higher, that’s high variance/overfitting β†’ I’d add regularization, simplify the model, or get more data. I validate changes with cross-validation and keep a fixed test set untouched.”