Generalized Policy Iteration

This is KEY to understanding how reinforcement learning works.

Generalized Policy Iteration refers to the general idea of letting policy-evaluation and policy-improvement processes interact.

Computing Value Function with Value Function with Dynamic Programming in Reinforcement Learning
Monte-Carlo Control
TD Control

Steps

Initialize policy $π$
Repeat
- Policy evaluation: Compute $Q_{π}$ for model-free, $V_{π}$ for model-based
- Policy improvement: update $π$ $π^{'} (s) = a argmax Q_{π} (s, a)$

Key Terminologies

Prediction / Learning → policy evaluation Control → policy-improvement

Some Ideas that I need to master

In David Silver’s diagrams, the up direction is to find $v_{π}$ , while the down arrows are to find $π$ .

With Model-Based problems, we use our full knowledge of the Markov Decision Process, using DP combined with Value Iteration or Policy Iteration..? For the policy improvement, we use a policy that acts greedily with respect to V(s).

However, in Model-Free, where we don’t know the environment, we don’t have that luxury. Greedy policy improvement over V(s) requires model of MDP.

In model-free, we use $Q (s, a)$ . We reduce the burden. We can cache?

David Silver teaches two main methods: Monte-Carlo (full backup) and TD (bootstrap).

🛠️ Steven Gong

Table of Contents

Generalized Policy Iteration

Steps

Key Terminologies

Some Ideas that I need to master

Graph View

Backlinks