We do Model-Free Policy Evaluation first, and then Model-Free Control.
Model-Free Control
By Control, we mean finding an optimal policy for the agent to use.
For the control problem (finding an optimal policy), DP, TD, and Monte Carlo methods all use some variation of Generalized Policy Iteration.
What is this?
I learned this course through the lens of David silver’s teaching, where most of it is based off Bootstrapping using the Bellman Equation.
Model-free control = policy improvement (as opposed to policy evaluation)
Topics
- Doing Model-Free Control
On-Policy vs. Off-Policy Learning
There is this conflict between Exploration and Exploitation. If we only follow a certain policy in an unknown environment, we are never going to find the optimal solution, which is why we need to continually explore.
One of the simplest ways to do police-improvement is thus using the Epsilon-Greedy to ensure that we are continually exploring.
Off-policy is really cool because you use other policies to evaluate your current policy. It is very interesting.
In Model-Based Control, since you already know the MDP, you don’t have to worry about Exploration since you know the whole sample Space. Since you are using, for example, MC for Model-Based Control, you cannot guarantee that you have sampled from the entire sample space, even though it is an unbiased estimator. That is why in the control/policy improvement step, you need to continue exploring.
Now the problem is, how can you guarantee with this continual exploration that you will converge to the Optimal Policy? This is where GLIE comes in.
Insights
There are the insights inside [[notes/Model-Free Policy Evaluation#Insights connection to Model-Free Control q-values vs state-value|Model-Free Policy Evaluation#Insights connection to Model-Free Control q-values vs state-value]].