Model-Free Policy Evaluation

Monte-Carlo Learning

Monte-Carlo Learning is a Model-Free Policy Evaluation method. it is unbiased..? (i think every visit is biased?)

Intuition: Takes averages of actual returns over episodes. As more returns are observed, the average should converge to the expected value.

  • MC methods learn directly from episodes of experience
  • MC learns from complete episodes: no bootstrapping
    • Caveat: This means that we can only apply it to episodic MDPs, i.e. all episodes must terminate
  • Handles non-markovian domains

MC Policy Evaluation

Goal: learn from episodes of experience under policy

Monte-Carlo policy evaluation uses empirical (the actual) mean return instead of expected return as we saw in Policy Iteration.

  • Updates at the end of each episode
    • is an estimate of the value function using sample of return to approximate the expectation

First-Visit Monte-Carlo Policy Evaluation

For each state visited in episode ,

  • The first time-step that state is visited in episode
    • Increment counter
    • Increment total return
    • Value is estimated by mean return
    • By Law of Large Numbers, as


  • estimator is an unbiased Estimator of true

Every-Visit Monte-Carlo Policy Evaluation

Same a first-visit, but update is done every time-step that state is visited in episode .


  • every-visit MC estimator is a biased estimator of but consistent estimator and often has better MSE.

Incremental Monte-Carlo (MC)

Monte-Carlo methods can be implemented incrementally by using the Incremental Mean.

  • Update incrementally after episode
  • For each state with return

We can replace with to get a more general Incremental MC Policy Evaluation:

  • If we set , then we attribute higher weight to newer data