# Policy Iteration

Intuition: build a policy, then build value function, and then build a better policy, and rinse and repeat until the optimal policy.

Policy iteration is the process of alternating between 2 things:

- Policy Evaluation With fixed current policy $π$, find values with simplified Bellman expectation backups (iterate until values converge)

$v_{i+1}(s)=∑_{s_{′},r}p(s_{′},r∣s,π(s))[r+λv_{i}(s_{′})]$ Other Notation from CS287 $V_{i+1}(s)=∑_{s_{′}}T(s,π(s),s_{′})[R(s,π_{k}(s),s_{′})+γV_{i}(s_{′})]$

- Notice that the above is extremely similar, instead of using the $T(s,a,s_{′})$ in Value Iteration we can use $T(s,π(s),s_{′})$ to evaluate a particular policy.
- We are not taking the max because we don’t have a choice, the policy $π$ decides the action $a=π(s)$

- Policy Improvement With fixed utilities, find the best action according to one-step lookahead. You basically override the action for the very first action.

- Improve the policy by acting greedily with respect to $v_{π}$ (Policy Improvement) $π_{′}=greedy(v_{π})$ $π_{′}(s)=a∈Aargmax q_{π}(s,a)$ $π_{k+1}(s)=argmax_{a}∑_{s_{′},r}p(s_{′},r∣s,a)(r+γv_{k}(s_{′}))$ Alternative notation from CS287: $π_{k+1}(s)=argmax_{a}∑_{s_{′}}T(s,a,s_{′})[R(s,a,s_{′})+γV_{π_{k}}(s_{′})]$ Repeat steps 1 and 2 until the policy converges, i.e. $π_{k}(s)=π_{k+1}(s)$

Policy iteration always converges to the optimal policy $π_{∗}$. In terms of comparing with Value Iteration, this is a lot longer it seems, because we need to figure calculate the value function for a particular policy, and then come up with a new policy.

### Proof

We essentially act greedily with respect to the action-value functions, doing a Bellman Optimality Backup.

This improves the value from any state $s$ over one step, $q_{π}(s,π_{′}(s))≥q_{π}(s,π(s))$ It therefore improves the value function, $v_{π_{′}}(s)≥v_{π}(s)$