Bellman Equation

The bellman equation relates the value of a current state with the value of successive states.

They decompose the Value Function into the immediate reward + expected future rewards. They are ways to concretely represent our different functions, and later solve them.

IMPORTANT

Make sure to understand the difference between the bellman expectation backup and bellman optimality backup. See at bottom.

Bellman Expectation Equation

Because it is a MDP, the value function can be decomposed into two parts: immediate reward discounted value of successor state .

&=\mathbb{E}_\pi[R_{t+1} + γv(S_{t+1}) | S_t = s]\\ [3pt] &= \sum_a \pi(a|s) \sum_{s',r}p(s',r|s,a) [r + \lambda v_\pi(s')] \end{aligned}$$ We usually use an iterative algorithm to solve the [[notes/Value Function|Value Function]], doing the following: - Initialize $V_{0}(s) = 0$ for all $s$ - Start at $k=1$ and continue until convergence (i.e. $|v_k - v_{k-1}| < \epsilon$) - $\forall s \in S$, $$V_{k}^\pi(s) = R(s, \pi(s)) + \gamma \sum_{s' \in S}{P(s'|s, \pi(s))V_{k-1}^\pi(s')}$$ The above is called a **bellman expectation backup** for a particular policy. It is basically the same equation as the iterative algorithm to solve for [[notes/Markov Decision Process|Markov Reward Process]], we just plug in the policy, since MDP = MRP + policy. We use this in [[notes/Policy Evaluation|Policy Evaluation]]. The bellman equation is a linear equation. It also can be solved directly by solving the matrix, where $$v = R + \lambda Pv$$ ![[attachments/Screen Shot 2021-12-08 at 3.45.40 PM.png]] $$v = R(I-\gamma P)^{-1}$$ > [!danger] Danger > > After some reflecting, I am realizing why I see some inconsistencies. > > Sometimes, I see this format > $$V = R + \gamma P V$$ > $$V_{k}^\pi(s) = R(s, \pi(s)) + \gamma \sum_{s' \in S}{P(s'|s, \pi(s))V_{k-1}^\pi(s')}$$ > so the reward $R$ seems to be separate, but in other equations I see this form > $$v_{k+1}(s) = \sum_{a}\pi(a|s) \sum_{s',r}p(s',r|s,a)(r + \gamma v_k(s'))$$ > > Actually, they both are the same. The first is for a deterministic policy, while the second is for a stochastic policy. > > the capitalized version $R$ represents a [[notes/Function|Function]], while a lowercase $r$ is a discrete value. These are all minor notation differences. ### [[notes/Principle of Optimality|Principle of Optimality]] Any [[notes/Optimal Policy|Optimal Policy]] can be subdivided into two components: 1. An optimal first action $A_∗$ 2. Followed by an optimal policy from successor state $S'$ > [!note] Theorem (Principle of Optimality) > > >A policy $π(a|s)$ achieves the optimal value from state $s$, $v_π(s) = v_∗(s)$, if and only if >- For any state $s'$ reachable from $s$, >- $π$ achieves the optimal value from state $s'$, $v_π(s') = v_∗(s')$ ### Bellman Optimality Equation The Bellman Optimality Equation defines how the optimal value of a state is related to the optimal value of successor states. It has a "max" instead of an average. The [[notes/Value Function#optimal-value-function|optimal value functions]] are recursively related by the Bellman optimality equations: $$v_*(s) = \max_a \sum_{s',r} p(s',r |s,a)[r + \lambda v_*(s')]$$ For the [[State-val]] $$q_*(s,a) = \sum_{s',r} p(s',r |s,a)[r + \lambda \max_{a'}q_*(s', a')]$$ Other notation: $$Q_*(s,a) = R(s,a) + \gamma \sum\limits_{s' \in S} P_{ss'}^a \max_{a' \in A} Q_*(s', a')$$ #gap-in-knowledge review david silver lectures, I don't understand the big picture, why does a 1-step lookahead give you the optimal policy? Page 86 of RL book A greedy policy is actually optimal in the long-term sense in which we are interested because $v_*$ already takes into account the reward consequences of all possible future behavior. By means of $v_*$, the optimal expected long-term return is turned into a quantity that is locally and immediately available for each state. Hence, a one-step-ahead search yields the long-term optimal actions. #### Solving the Bellman Optimality Equation Bellman Optimality Equation is non-linear. - No closed form solution - Instead, there are many iterative solution methods - [[notes/Value Iteration|Value Iteration]] - [[notes/Policy Iteration|Policy Iteration]] - [[notes/Q-Learning|Q-Learning]] - [[notes/Sarsa|Sarsa]] ### Difference between Bellman Expectation and Bellman Optimality In summary, we have **bellman expectation** (used in [[notes/Policy Evaluation|Policy Evaluation]]) $$v^\pi_{k+1}(s) = \sum_{a}\pi(a|s) \sum_{s',r}p(s',r|s,a)(r + \gamma v_k^\pi(s'))$$ Else, we have **bellman optimality** (used in [[notes/Value Iteration|Value Iteration]]), where we don't have a policy, simply take the max action $$v_{k+1}(s) = \max_{a'} \sum_{s',r}p(s',r|s,a)[r + \lambda v_k(s')]$$ The difference is that we use the bellman expectation to get the expected values of a particular policy $\pi$. On the other hand, we can use the bellman optimality to get the optimal value function (these values are obtained if we can somehow find an optimal policy $\pi$).