Policy Iteration

Intuition: build a policy, then build value function, and then build a better policy, and rinse and repeat until the optimal policy.

Policy iteration is the process of alternating between 2 things:

Policy Evaluation With fixed current policy $π$ , find values with simplified Bellman expectation backups (iterate until values converge)

$v_{i + 1}^{π} (s) = \sum_{s^{'}, r} p (s^{'}, r ∣ s, π (s)) [r + λ v_{i}^{π} (s^{'})]$ Other Notation from CS287 $V_{i + 1}^{π_{k}} (s) = \sum_{s^{'}} T (s, π (s), s^{'}) [R (s, π_{k} (s), s^{'}) + γ V_{i}^{*} (s^{'})]$

Notice that the above is extremely similar, instead of using the $T (s, a, s^{'})$ in Value Iteration we can use $T (s, π (s), s^{'})$ to evaluate a particular policy.
We are not taking the max because we don’t have a choice, the policy $π$ decides the action $a = π (s)$

Policy Improvement With fixed utilities, find the best action according to one-step lookahead. You basically override the action for the very first action.

Improve the policy by acting greedily with respect to $v_{π}$ (Policy Improvement) $π^{'} = greedy (v_{π})$ $π^{'} (s) = a \in A argmax q_{π} (s, a)$ $π_{k + 1} (s) = argmax_{a} \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) (r + γ v_{k} (s^{'}))$

How does acting epsilon greedy actually result in policy improvement??

Acting ε-greedy with respect to Q(s, a) leads to policy improvement because the greedy part exploits the best-known action, and the ε-randomness ensures exploration that lets us discover even better actions. Over time, this exploration refines Q(s, a), and thus, the greedy part improves too.

umm, but like wouldn’t that refine the Q negatively too?? Exploration can lead to bad actions. See proof below

Alternative notation from CS287: $π_{k + 1} (s) = argmax_{a} \sum_{s^{'}} T (s, a, s^{'}) [R (s, a, s^{'}) + γ V^{π_{k}} (s^{'})]$ Repeat steps 1 and 2 until the policy converges, i.e. $π_{k} (s) = π_{k + 1} (s)$

Policy iteration always converges to the optimal policy $π_{*}$ . In terms of comparing with Value Iteration, this is a lot longer it seems, because we need to figure calculate the value function for a particular policy, and then come up with a new policy.

Note

The example below is done with action-value functions, whereas the textbook uses state-value functions. It’s essentially the same thing, but in Model-Free Control, we always use q-values, and model-free are where the more interesting problems are.

Proof (IMPORTANT)

We essentially act greedily with respect to the action-value functions, doing a Bellman Optimality Backup.

https://www.tuananhle.co.uk/notes/policy-improvement-theorem.html

This improves the value from any state $s$ over one step, $q_{π} (s, π^{'} (s)) \geq q_{π} (s, π (s))$ It therefore improves the value function, $v_{π^{'}} (s) \geq v_{π} (s)$

Value Iteration

🛠️ Steven Gong

Table of Contents

Policy Iteration

Proof (IMPORTANT)

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Policy Iteration

Proof (IMPORTANT)

Related

Graph View

Backlinks