Model-Free Policy Evaluation
The policy evaluation or prediction problem is the problem of estimating the value function for a given policy .
I was introduced to two methods:
These are learning methods, to generate the Value Function based on a particular policy.
We say it is Model-Free because we don’t know beforehand how the reward system and dynamics (MDP transitions) works, so we can’t directly solve everything with dynamic programming .
Why Model-Free Approach?
Why model-free approach, when we can just model the environment as an MDP? Here are some applications (We can create accurate simulators, but they are very computationally expensive):
- Elevator
- Parallel Parking
- Ship Steering
- Helicopter
- Robocup Soccer
- Portfolio management
- Protein Folding
- Robot walking
- Game of Go
For most of these problems, either:
- MDP model is unknown, but experience can be sampled
- MDP model is known, but is too big to use, except by samples
Model-free approaches can solve these problems.
Insights, connection to Model-Free Control, q-values vs. state-value
For policy evaluation to work for action values, we must assure continual exploration.
When we do MC Learning, and the policy is deterministic, we are just sampling one path. So we are not exploring at all.
To fix this, we can specify that the episodes start in a state–action pair, and that every pair has a nonzero probability of being selected as the start. → this is called the assumption of exploring starts.
State-value function vs. action-value function. In Model-Free Control, we want to use the action-value (q-value) of the value function for it to be useful. The stuff here was explained using the state-value function, but we can easily generalize to q-value functions.
Instead of visiting state, we talk about visiting state-action pairs.
This is why in Model-Free Control techniques, we talk about q-values and rarily state-value functions.
See more on Model-Based vs. Model-Free RL.