Value Function Approximation

CS287 - Advanced Robotics

I think it’s this idea that you hand engineering you value function as a sum of features $θ$ .

Whereas before we defined our value function in terms of a Bellman Equation

Example 1: Tetris

See Value Iteration where we show the implementation combining that.

Intuition

So far we have represented value function by a lookup table, i.e. every state $s$ has an entry $V (s)$ on the table, or every state-action pair $s, a$ has an entry $Q (s, a)$ on the table.

However, with large MDPs:

There are too many states and/or actions to store in memory
It is too slow to learn the value of each state individually

Real-world problems oftentimes have enormous state and/or actions spaces, so the tabular representation is insufficient.

Solution for large MDPs:

Estimate value function with function approximation $v (s, w) \approx v_{π} (s)$ $q (s, a, w) \approx q_{π} (s, a)$

David Silver: To update w, we use MC or TD Learning.

So what Function Approximators should we use? There are several options, but we focus on differentiable function approximations:

Linear Feature Representations
Neural Network

Linear Feature Representations

Some thoughts afterwards! Connections

This is actually the same thing as the gradient descent case below, except the equation $Δ w = α ((v_{π} (S) - v (S, w)) \nabla_{w} v (S, w)$ gets simplified to $Δ w = α ((v_{π} (S) - v (S, w)) x (S)$

Let $x (S)$ be the feature vector, $x (S) = (x_{1} (S) ⋮ x_{n} (S))$

Represent value function by a linear combination of features $v (S, w) = x (S)^{⊤} w = \sum_{j = 1}^{n} x_{j} (S) w_{j}$

Objective function is quadratic in parameters $w$ : $J (w) = E_{π} [v_{π} (S) - x (S)^{⊤} w)^{2}]$

Stochastic gradient descent converges on global optimum Update rule is particularly simple $\nabla_{w} v (S, w) = x (S)$ $∆ w = α (v_{π} (S) - v (S, w)) x (S)$ Update = step-size × prediction error × feature value

Notice in the above how all of this works assuming that we have this true value function $v_{π} (s)$ , this “oracle” as David Silver calls it, which tells us how much we are wrong for $J (w)$ . However, in RL, there is no supervisor, only rewards. We don’t have $v_{π} (s)$ . In practice, we substitute a target for $v_{π} (s)$ .

In MC, we replace $v_{π} (s)$ with $G_{t}$
In TD(0), we replace $v_{π} (s)$ with the TD-target
In TD( $λ$ ), we replace with the $λ$ -return

Approximation by Gradient Descent

Goal: find parameter vector w minimizing mean-squared error between approximate value fn ˆv(s, w) and true value fn vπ(s)

We defined the loss as $J (w) = E_{π} [(v_{π} (S) - v (S, w))^{2}]$

We can use gradient descent finds a local minimum $∆ w = - \frac{1}{2} α \nabla_{w} J (w)$ After applying chain rule, we get $Δ w = α E_{π} [(v_{π} (S) - v (S, w)) \nabla_{w} v (S, w)]$

Instead of doing the full gradient descent, we can sample the gradient with stochastic gradietn descent.

$Δ w = α ((v_{π} (S) - v (S, w)) \nabla_{w} v (S, w)$ Expexted update is equal to full gradient update.

TD Learning with VFA

TD is biased. 3 forms of approximation:

function approximation
bootstrapping
sampling

VFA for Control

Policy Evaluation - Approximate policy evaluation

🛠️ Steven Gong

Table of Contents