# Value Function Approximation

### CS287 - Advanced Robotics

I think itโs this idea that you hand engineering you value function as a sum of features $ฮธ$.

Whereas before we defined our value function in terms of a Bellman Equation

Example 1: Tetris

See Value Iteration where we show the implementation combining that.

### Intuition

So far we have represented value function by a lookup table, i.e. every state $s$ has an entry $V(s)$ on the table, or every state-action pair $s,a$ has an entry $Q(s,a)$ on the table.

However, with large MDPs:

- There are too many states and/or actions to store in memory
- It is too slow to learn the value of each state individually

Real-world problems oftentimes have enormous state and/or actions spaces, so the **tabular representation is insufficient**.

Solution for large MDPs:

- Estimate value function with function approximation $v(s,w)โv_{ฯ}(s)$ $qโ(s,a,w)โq_{ฯ}(s,a)$

David Silver: To update w, we use MC or TD Learning.

So what Function Approximators should we use? There are several options, but we focus on differentiable function approximations:

- Linear Feature Representations
- Neural Network

### Linear Feature Representations

Some thoughts afterwards! Connections

This is actually the same thing as the gradient descent case below, except the equation $ฮw=ฮฑ((v_{ฯ}(S)โv(S,w))โ_{w}v(S,w)$ gets simplified to $ฮw=ฮฑ((v_{ฯ}(S)โv(S,w))x(S)$

Let $x(S)$ be the feature vector, $x(S)=(x_{1}(S)โฎx_{n}(S))$

Represent value function by a linear combination of features $v(S,w)=x(S)_{โค}w=โ_{j=1}x_{j}(S)w_{j}$

Objective function is quadratic in parameters $w$: $J(w)=E_{ฯ}[v_{ฯ}(S)โx(S)_{โค}w)_{2}]$

Stochastic gradient descent converges on global optimum Update rule is particularly simple $โ_{w}v(S,w)=x(S)$ $โw=ฮฑ(v_{ฯ}(S)โv(S,w))x(S)$ Update = step-size ร prediction error ร feature value

Notice in the above how all of this works assuming that we have this true value function $v_{ฯ}(s)$, this โoracleโ as David Silver calls it, which tells us how much we are wrong for $J(w)$. However, in RL, there is no supervisor, only rewards. We donโt have $v_{ฯ}(s)$. In practice, we substitute a target for $v_{ฯ}(s)$.

- In MC, we replace $v_{ฯ}(s)$ with $G_{t}$
- In TD(0), we replace $v_{ฯ}(s)$ with the TD-target
- In TD($ฮป$), we replace with the $ฮป$-return

### Approximation by Gradient Descent

**Goal**: find parameter vector w minimizing mean-squared error between approximate value fn หv(s, w) and true value fn vฯ(s)

We defined the loss as $J(w)=E_{ฯ}[(v_{ฯ}(S)โv(S,w))_{2}]$

We can use gradient descent finds a local minimum $โw=โ21โฮฑโ_{w}J(w)$ After applying chain rule, we get $ฮw=ฮฑE_{ฯ}[(v_{ฯ}(S)โv(S,w))โ_{w}v(S,w)]$

Instead of doing the full gradient descent, we can sample the gradient with stochastic gradietn descent.

$ฮw=ฮฑ((v_{ฯ}(S)โv(S,w))โ_{w}v(S,w)$ Expexted update is equal to full gradient update.

### TD Learning with VFA

TD is biased. 3 forms of approximation:

- function approximation
- bootstrapping
- sampling

#### VFA for Control

Policy Evaluation - Approximate policy evaluation