# Value Function

The Value function is used to estimate how good it is for the agent to be in a given state.

Used for Policy Evaluation.

Whereas the Reward indicates what is good in the immediate sense, the value function specifies what is good in the long run.

For an MRP, The value function $v(s)$ gives the long-term value of state $s$. They are defined in terms of a Policy.

Definition

The

state-value function$v_{π}(s)$ of an MDP is the expected Return starting from state $s$, and then following policy $π$ $v_{π}(s)=E_{π}[G_{t}∣S_{t}=s]$

Definition

The

action-value function$q_{π}(s,a)$ of an MDP is the expected Return starting from state $s$, taking action $a$, and then following policy $π$ $q_{π}(s,a)=E_{π}[G_{t}∣S_{t}=s,A_{t}=a]$

But we can do better, because we are making assumptions that the problem we are solving is a MDP, so the current state has captured all information (not dependent on past rewards). We can break the equation into two parts (the idea of breaking our value function down into these 2 parts is called a Bellman Equation):

$Value=immediate reward+discounted sum of future rewards (future value)$

With the value function, the probability (weight of each edge) is given by the policy, we decide that. Below are called Backup Diagram: $v_{π}(s)=∑_{a}π(a∣s)q_{π}(s,a)$ On the other hand, for the q-value (i.e. after we have done an action), this is what we do with the environment. After choosing an action, we get a reward, and the probability of landing into new states are given by our environment. We also apply the Discount Factor here. $q_{π}(s,a)=∑_{s_{′},r}p(s_{′},r∣s,a)(r+γv_{π}(s_{′}))$ So now, we can have a recursive relationship. $v_{π}(s)=∑_{a}π(a∣s)∑_{s_{′},r}p(s_{′},r∣s,a)(r+γv_{π}(s_{′}))$ $q_{π}(s,a)=∑_{s_{′},r}p(s_{′},r∣s,a)(r+γ∑_{a_{′}}π(a_{′}∣s_{′})q_{π}(s_{′},a_{′}))$

We use Bellman Equation to calculate the value function.

Important

The value function of the terminal state is always $v(s_{terminal})=0$

### Computing the Value Function

We can run a bunch of simulations and average the returns to compute the value function of a particular state, where you use the janky equation that you don’t usually use.

### State-Value vs. Action-Value Function

Maybe I talked about this somewhere else, but in Model-Free Control, you ABSOLUTELY need to use action-value function $Q(s,a)$, it will not work with $V(s)$, because that depends on the next value function? See Generalized Policy Iteration