Value Function

The Value function is used to estimate how good it is for the agent to be in a given state.

Whereas the Reward indicates what is good in the immediate sense, the value function specifies what is good in the long run.

For an MRP, The value function $v (s)$ gives the long-term value of state $s$ . They are defined in terms of a Policy.

Definition

The state-value function $V^{π} (s)$ of an MDP is the expected Return starting from state $s$ , and then following policy $π$ $V^{π} (s) = E [G_{t} ∣ S_{t} = s]$

Definition

The action-value function $Q^{π} (s, a)$ of an MDP is the expected Return starting from state $s$ , taking action $a$ , and then following policy $π$ $Q^{π} (s, a) = E [G_{t} ∣ S_{t} = s, A_{t} = a]$

But we can do better, because we are making assumptions that the problem we are solving is a MDP, so the current state has captured all information (not dependent on past rewards). We can break the equation into two parts (this idea of breaking our value function down into these 2 parts is called a Bellman Equation):

$Value = immediate reward + discounted sum of future rewards (future value)$

Important

The value function of the terminal state is always $v (s_{t er mina l}) = 0$

Because it is a MDP, the value function can be decomposed into two parts: immediate reward $R_{t + 1}$ discounted value of successor state $γ v (S_{t + 1})$ .

v_{π} (s) = E_{π} [G_{t} ∣ S_{t} = s] = E_{π} [R_{t + 1} + γ v (S_{t + 1}) ∣ S_{t} = s] = a \sum π (a ∣ s) s^{'}, r \sum p (s^{'}, r ∣ s, a) [r + λ v_{π} (s^{'})]

We usually use an iterative algorithm to solve the Value Function, doing the following:

Initialize $V_{0} (s) = 0$ for all $s$
Start at $k = 1$ and continue until convergence (i.e. $∣ v_{k} - v_{k - 1} ∣ < ϵ$ )
$\forall s \in S$ , $V_{k}^{π} (s) = R (s, π (s)) + γ \sum_{s^{'} \in S} P (s^{'} ∣ s, π (s)) V_{k - 1}^{π} (s^{'})$ The above is called a bellman expectation backup for a particular policy. It is basically the same equation as the iterative algorithm to solve for Markov Reward Process, we just plug in the policy, since MDP = MRP + policy.

We use this in Policy Evaluation.

Computing the Value Function

We can run a bunch of simulations and average the returns to compute the value function of a particular state, where you use the janky equation that you don’t usually use.

State-Value vs. Action-Value Function

Maybe I talked about this somewhere else, but in Model-Free Control, you ABSOLUTELY need to use action-value function $Q (s, a)$ , it will not work with $V (s)$ , because that depends on the next value function? See Generalized Policy Iteration

Optimal Value Function

🛠️ Steven Gong

Table of Contents

Value Function

Computing the Value Function

State-Value vs. Action-Value Function

Next

Graph View

Backlinks