We do Model-Free Policy Evaluation first, and then Model-Free Control.

Model-Free Control

By Control, we mean finding an optimal policy for the agent to use.

For the control problem (finding an optimal policy), DP, TD, and Monte Carlo methods all use some variation of Generalized Policy Iteration.

Topics

On-Policy vs. Off-Policy Learning

There is this conflict between Exploration and Exploitation. If we only follow a certain policy in an unknown environment, we are never going to find the optimal solution, which is why we need to continually explore.

One of the simplest ways to do police-improvement is thus using the Epsilon-Greedy to ensure that we are continually exploring.

Off-policy is really cool because you use other policies to evaluate your current policy. It is very interesting.

In Model-Based Control, since you already know the MDP, you don’t have to worry about Exploration since you know the whole sample Space. Since you are using, for example, MC for Model-Based Control, since you already know the MDP, you don’t have to worry about Exploration since you know the whole sample Space. Since you are using, for example, MC for Model-Free Policy Evaluation, you cannot guarantee that you have sampled from the entire sample space, even though it is an unbiased estimator. That is why in the control/policy improvement step, you need to continue exploring.

Now the problem is, how can you guarantee with this continual exploration that you will converge to the Optimal Policy? This is where GLIE comes in.

Insights

There are the insights inside [[notes/Model-Free Policy Evaluation#Insights connection to Model-Free Control q-values vs state-value|Model-Free Policy Evaluation#Insights connection to Model-Free Control q-values vs state-value]].