Policy Iteration

Intuition: build a policy, then build value function, and then build a better policy, and rinse and repeat until the optimal policy.

Policy iteration is the process of alternating between 2 things:

  1. Policy Evaluation With fixed current policy , find values with simplified Bellman expectation backups (iterate until values converge)

Other Notation from CS287

  • Notice that the above is extremely similar, instead of using the in Value Iteration we can use to evaluate a particular policy.
  • We are not taking the max because we don’t have a choice, the policy decides the action
  1. Policy Improvement With fixed utilities, find the best action according to one-step lookahead. You basically override the action for the very first action.
  • Improve the policy by acting greedily with respect to (Policy Improvement) Alternative notation from CS287: Repeat steps 1 and 2 until the policy converges, i.e.

Policy iteration always converges to the optimal policy . In terms of comparing with Value Iteration, this is a lot longer it seems, because we need to figure calculate the value function for a particular policy, and then come up with a new policy.

The example below is done with action-value functions, whereas the textbook uses state-value functions. It's essentially the same thing, but in [[Model-Free Control]], we always use q-values, and model-free are where the more interesting problems are. 

Proof

We essentially act greedily with respect to the action-value functions, doing a Bellman Optimality Backup.

This improves the value from any state over one step, It therefore improves the value function,