Monte-Carlo vs. Temporal Difference
David Silver talks about this Bias - Variance Tradeoff.
MC is unbiased but has high variance between updates. TD(0) is highly biased but has low variance across updates.
I always confuse the bias-variance tradeoff
Youβre going to fail the interview if you keep confusing this lol. Monte-Carlo is much less biased because it doesnβt rely on bootstrapping. Variance comes form the fact that each episode might have drastically different returns.
TD() tries to combine the best out of both worlds.
Where does the bias come from?
It comes from Bootstrapping, i.e. updating values off estimates. TD(0) relies a lot off bootstrapping, whereas MC does not (since it does full rollouts).
Letβs write down our equation
In TD(0) (Q-Learning), we do the update as following: Because we use to update , where is an estimate of the true Q, itβs always going to be biased. Compare that to MC Learning, where we use the Return to update the value function.
How does this work in the context of Offline RL and Off-Policy?
Like there is a layer of complexity that is added.
When you go off-policy, you introduce off-policy bias. This bias comes from the fact that is different from the one you are actually learning .
How does this work in the context of Off-Policy?
In offline RL, This is known as distributional shift β your Q-function generalizes to unseen actions/states and bootstraps on them, which leads to compounding bias.
Bias and Variance
Bias measures how far the expected estimate is from the true value:
Variance measures how much the estimates vary around their own mean, not around the true value:
βWe have seen this repeatedly, such as in Chapters 7 (Figure 7.2) and 9 (Figure 9.2), where some degree of bootstrapping performed much better than Monte Carlo methods on the random-walk prediction task, and in Chapter 10 where the same was seen on the Mountain-Car control task (Figure 10.4). Many other problems show much faster learning with bootstrapping (e.g., see Figure 12.14β
Me wanting to visualize this
Above is my basic thought experiment to better internalize this. Imagine we have an environment with only 8 states (). Certain states have much higher reward variance than other states. No discount factor. After visiting each state, we get a reward of 10. So in theory, these should be the value functions:
- β¦
However, when we step through the environment, the reward we get can be quite noisy. Sometimes, we get 10, but sometimes, we actually get 2, sometimes 30. is the most noisy.
Letβs say for a given episode rollout, we get the following rewards: And then we get So and .
MC: First time
But if we do TD:
That noisy update will only affect
So really, the noise stays within , and it doesnβt propagate as fast to the other estimates. Itβs really just smoothed out.
This reduction in variance means that even though each update is βlocal,β they stabilize faster. You donβt need as many samples to wash out the noise.
The issue is that the variance is additive. But we donβt have that issue if the reward is just terminal reward.
- see Law of total Variance
In Sparse reward setting
The Horizon Reduction Makes RL Scalable made me think a lot about this.
What I donβt understand:
- why is TD more sample efficient