Bellman Equation

The bellman equation relates the value of a current state with the value of successive states.

This is the fundamental idea: $Value = immediate reward + discounted sum of future rewards (future value)$ $V^{π} (s) = E_{π} [r + γ V^{π} (s^{'})]$ All of value-learning based RL bases off the above equation.

Bellman Expectation and Bellman Optimality

We have bellman expectation (used in Policy Evaluation) which defines the expected value of a state relating to successor states.

V^{π} (s) = E_{a \sim π (\cdot ∣ s), s^{'}, r \sim p (\cdot ∣ s, a)} [r + γ V^{π} (s^{'})] = a \sum π (a ∣ s) s^{'}, r \sum p (s^{'}, r ∣ s, a) [r + γ V^{π} (s^{'})]

Q^{π} (s, a) = E_{s^{'}, r \sim p (\cdot ∣ s, a)} [r + γ E_{a^{'} \sim π (\cdot ∣ s^{'})} Q^{π} (s^{'}, a^{'})] = s^{'}, r \sum p (s^{'}, r ∣ s, a) [r + γ a^{'} \sum π (a^{'} ∣ s^{'}) Q^{π} (s^{'}, a^{'})]

We have bellman optimality (used in Policy Improvement), which defines how the optimal value of a state is related to the optimal value of successor states.

V^{*} (s) V^{*} (s) = a max E_{s^{'}, r \sim P} [r + γ V^{*} (s^{'})] = a^{'} max s^{'}, r \sum p (s^{'}, r ∣ s, a) [r + λ V^{*} (s^{'})]

Q^{*} (s, a) = E_{s^{'}, r \sim p (\cdot ∣ s, a)} [r + γ a^{'} max Q^{*} (s^{'}, a^{'})] = s^{'}, r \sum p (s^{'}, r ∣ s, a) [r + γ a^{'} max Q_{*} (s^{'}, a^{'})]

IMPORTANT

Make sure to understand the difference between the bellman expectation backup and bellman optimality backup. This is fundamental in RL:

The bellman expectation to get the expected values of a particular policy $π$

The bellman optimality to get the optimal value function (these values are obtained if we can somehow find an optimal policy $π$ )

Visualizing the Bellman Update

These are taken from the RL textbook, and use lowercase $v_{π}$ and $q_{π}$ as opposed to $V^{π}$ and $Q^{π}$ , but they refer to the same thing.

With the value function, the probability (weight of each edge) is given by the policy, we decide that. Below are called Backup Diagram: Screen Shot 2021-12-11 at 2.25.17 PM.png $v_{π} (s) = \sum_{a} π (a ∣ s) q_{π} (s, a)$ On the other hand, for the q-value (i.e. after we have done an action), this is what we do with the environment. After choosing an action, we get a reward, and the probability of landing into new states are given by our environment. We also apply the Discount Factor here. $q_{π} (s, a) = \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) (r + γ v_{π} (s^{'}))$ So now, we can have a recursive relationship. $v_{π} (s) = \sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) (r + γ v_{π} (s^{'}))$ $q_{π} (s, a) = \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) (r + γ \sum_{a^{'}} π (a^{'} ∣ s^{'}) q_{π} (s^{'}, a^{'}))$

#gap-in-knowledge review david silver lectures, I don’t understand the big picture, why does a 1-step lookahead give you the optimal policy? Page 86 of RL book

Because of GLIE, the backup structure and GLIE together ensure convergence to optimality

A greedy policy is actually optimal in the long-term sense in which we are interested because $v_{*}$ already takes into account the reward consequences of all possible future behavior.

By means of $v_{*}$ , the optimal expected long-term return is turned into a quantity that is locally and immediately available for each state. Hence, a one-step-ahead search yields the long-term optimal actions.

Solving the Bellman Optimality Equation

Bellman Optimality Equation is non-linear. There are No closed form solution.

Instead, there are many iterative solution methods:

🛠️ Steven Gong

Table of Contents

Bellman Equation

Bellman Expectation and Bellman Optimality

Visualizing the Bellman Update

Solving the Bellman Optimality Equation

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Bellman Equation

Bellman Expectation and Bellman Optimality

Visualizing the Bellman Update

Solving the Bellman Optimality Equation

Related

Graph View

Backlinks