Bellman Equation

The bellman equation relates the value of a current state with the value of successive states.

They decompose the Value Function into the immediate reward + expected future rewards. They are ways to concretely represent our different functions, and later solve them.

IMPORTANT

Make sure to understand the difference between the bellman expectation backup and bellman optimality backup. See at bottom.

Bellman Expectation Equation

Because it is a MDP, the value function can be decomposed into two parts: immediate reward discounted value of successor state .

We usually use an iterative algorithm to solve the Value Function, doing the following:

  • Initialize for all
  • Start at and continue until convergence (i.e. )
  • , The above is called a bellman expectation backup for a particular policy. It is basically the same equation as the iterative algorithm to solve for Markov Reward Process, we just plug in the policy, since MDP = MRP + policy.

We use this in Policy Evaluation.

The bellman equation is a linear equation. It also can be solved directly by solving the matrix, where

Danger

After some reflecting, I am realizing why I see some inconsistencies.

Sometimes, I see this format so the reward seems to be separate, but in other equations I see this form

Actually, they both are the same. The first is for a deterministic policy, while the second is for a stochastic policy.

the capitalized version represents a Function, while a lowercase is a discrete value. These are all minor notation differences.

Principle of Optimality

Any Optimal Policy can be subdivided into two components:

  1. An optimal first action
  2. Followed by an optimal policy from successor state

Theorem (Principle of Optimality)

A policy achieves the optimal value from state , , if and only if

  • For any state reachable from ,
  • achieves the optimal value from state ,

Bellman Optimality Equation

The Bellman Optimality Equation defines how the optimal value of a state is related to the optimal value of successor states. It has a β€œmax” instead of an average.

The optimal value functions are recursively related by the Bellman optimality equations:

For the State-val Other notation: #gap-in-knowledge review david silver lectures, I don’t understand the big picture, why does a 1-step lookahead give you the optimal policy? Page 86 of RL book

A greedy policy is actually optimal in the long-term sense in which we are interested because already takes into account the reward consequences of all possible future behavior.

By means of , the optimal expected long-term return is turned into a quantity that is locally and immediately available for each state. Hence, a one-step-ahead search yields the long-term optimal actions.

Solving the Bellman Optimality Equation

Bellman Optimality Equation is non-linear.

Difference between Bellman Expectation and Bellman Optimality

In summary, we have bellman expectation (used in Policy Evaluation) Else, we have bellman optimality (used in Value Iteration), where we don’t have a policy, simply take the max action The difference is that we use the bellman expectation to get the expected values of a particular policy . On the other hand, we can use the bellman optimality to get the optimal value function (these values are obtained if we can somehow find an optimal policy ).