Monte-Carlo Learning

Monte-Carlo Learning is a Model-Free Policy Evaluation method. it is unbiased..? (i think every visit is biased?)

Intuition: Takes averages of actual returns over episodes. As more returns are observed, the average should converge to the expected value.

MC methods learn directly from episodes of experience
MC learns from complete episodes: no bootstrapping
- Caveat: This means that we can only apply it to episodic MDPs, i.e. all episodes must terminate
Handles non-markovian domains

MC Policy Evaluation

Goal: learn $v_{π}$ from episodes of experience under policy $π$ $S_{1}, A_{1}, R_{2}, ..., S_{k} \sim π$

Monte-Carlo policy evaluation uses empirical (the actual) mean return instead of expected return as we saw in Policy Iteration.

Updates $V$ at the end of each episode
- $V$ is an estimate of the value function using sample of return to approximate the expectation

First-Visit Monte-Carlo Policy Evaluation

For each state $s$ visited in episode $i$ ,

The first time-step $t$ that state $s$ is visited in episode $i$
- Increment counter $N (s) \leftarrow N (s) + 1$
- Increment total return $G (s) \leftarrow G (s) + G_{i, t}$
- Value is estimated by mean return $V (s) = G (s) / N (s)$
- By Law of Large Numbers, $V (s) \to v_{π} (s)$ as $N (s) \to \infty$

Properties:

$V_{π}$ estimator is an unbiased Estimator of true $E_{π} [G_{t} ∣ S_{t} = s]$

Every-Visit Monte-Carlo Policy Evaluation

Same a first-visit, but update is done every time-step $t$ that state $s$ is visited in episode $i$ .

Properties:

$V_{π}$ every-visit MC estimator is a biased estimator of $V_{π}$ but consistent estimator and often has better MSE.

Incremental Monte-Carlo (MC)

Monte-Carlo methods can be implemented incrementally by using the Incremental Mean.

Update $V (s)$ incrementally after episode $S_{1}, A_{1}, R_{2}, ..., S_{T}$
For each state $S_{t}$ with return $G_{t}$ $N (S_{t}) \leftarrow N (S_{t}) + 1$ $V (S_{t}) \leftarrow V (S_{t}) + \frac{1}{N ( S _{t} )} (G_{t} - V (S_{t}))$

We can replace $\frac{1}{N ( S _{t} )}$ with $α$ to get a more general Incremental MC Policy Evaluation: $V (S_{t}) \leftarrow V (S_{t}) + α (G_{t} - V (S_{t}))$

If we set $α > \frac{1}{N ( S _{t} )}$ , then we attribute higher weight to newer data

Monte-Carlo Control

🛠️ Steven Gong

Table of Contents

Monte-Carlo Learning

MC Policy Evaluation

First-Visit Monte-Carlo Policy Evaluation

Every-Visit Monte-Carlo Policy Evaluation

Incremental Monte-Carlo (MC)

Next

Graph View

Backlinks