Model-Free Policy Evaluation

The policy evaluation or prediction problem is the problem of estimating the value function $v_{π}$ for a given policy $π$ .

I was introduced to two methods:

These are learning methods, to generate the Value Function based on a particular policy.

We say it is Model-Free because we don’t know beforehand how the reward system and dynamics (MDP transitions) works, so we can’t directly solve everything with dynamic programming .

title: Important Distinction
With model-based approaches, state values alone are sufficient to determine a policy.
 
However, this is not the case with model-free approaches. States values alone are not sufficient. **One must explicitly estimate the value of each action** in order for the values to be useful in suggesting a policy.

Why Model-Free Approach?

Why model-free approach, when we can just model the environment as an MDP? Here are some applications (We can create accurate simulators, but they are very computationally expensive):

Elevator
Parallel Parking
Ship Steering
Helicopter
Robocup Soccer
Portfolio management
Protein Folding
Robot walking
Game of Go

For most of these problems, either:

MDP model is unknown, but experience can be sampled
MDP model is known, but is too big to use, except by samples

Model-free approaches can solve these problems.

Insights, connection to Model-Free Control, q-values vs. state-value

For policy evaluation to work for action values, we must assure continual exploration.

When we do MC Learning, and the policy is deterministic, we are just sampling one path. So we are not exploring at all.

To fix this, we can specify that the episodes start in a state–action pair, and that every pair has a nonzero probability of being selected as the start. → this is called the assumption of exploring starts.

State-value function vs. action-value function. In Model-Free Control, we want to use the action-value (q-value) of the value function for it to be useful. The stuff here was explained using the state-value function, but we can easily generalize to q-value functions.

Instead of visiting state, we talk about visiting state-action pairs.

This is why in Model-Free Control techniques, we talk about q-values and rarily state-value functions.

See more on Model-Based vs. Model-Free RL.

🛠️ Steven Gong

Table of Contents

Model-Free Policy Evaluation

Why Model-Free Approach?

Insights, connection to Model-Free Control, q-values vs. state-value

Graph View

Backlinks