# Generalized Policy Iteration

This is KEY to understanding how reinforcement learning works.

Generalized Policy Iteration refers to the general idea of letting policy-evaluation and policy-improvement processes interact.

- Computing Value Function with Value Function with Dynamic Programming in Reinforcement Learning
- Monte-Carlo Control
- TD Control

### Steps

- Initialize policy $Ď€$
- Repeat
- Policy evaluation: Compute $Q_{Ď€}$ for model-free, $V_{Ď€}$ for model-based
- Policy improvement: update $Ď€$ $Ď€_{â€˛}(s)=aargmaxâ€‹Q_{Ď€}(s,a)$

### Key Terminologies

Prediction / Learning â†’ policy evaluation Control â†’ policy-improvement

### Some Ideas that I need to master

In David Silverâ€™s diagrams, the up direction is to find $v_{Ď€}$, while the down arrows are to find $Ď€$.

With Model-Based problems, we use our full knowledge of the Markov Decision Process, using DP combined with Value Iteration or Value Iteration or Policy Iteration..? For the policy improvement, we use a policy that acts greedily with respect to V(s).

However, in Model-Free, where we donâ€™t know the environment, we donâ€™t have that luxury. Greedy policy improvement over V(s) requires model of MDP.

In model-free, we use $Q(s,a)$. We reduce the burden. We can cache?

David Silver teaches two main methods: Monte-Carlo (full backup) and TD (bootstrap).