Value Function Approximation

CS287 - Advanced Robotics

I think itโ€™s this idea that you hand engineering you value function as a sum of features .

Whereas before we defined our value function in terms of a Bellman Equation

Example 1: Tetris

See Value Iteration where we show the implementation combining that.

Intuition

So far we have represented value function by a lookup table, i.e. every state has an entry on the table, or every state-action pair has an entry on the table.

However, with large MDPs:

  • There are too many states and/or actions to store in memory
  • It is too slow to learn the value of each state individually

Real-world problems oftentimes have enormous state and/or actions spaces, so the tabular representation is insufficient.

Solution for large MDPs:

  • Estimate value function with function approximation

David Silver: To update w, we use MC or TD Learning.

So what Function Approximators should we use? There are several options, but we focus on differentiable function approximations:

Linear Feature Representations

Some thoughts afterwards! Connections

This is actually the same thing as the gradient descent case below, except the equation gets simplified to

Let be the feature vector,

Represent value function by a linear combination of features

Objective function is quadratic in parameters :

Stochastic gradient descent converges on global optimum Update rule is particularly simple Update = step-size ร— prediction error ร— feature value

Notice in the above how all of this works assuming that we have this true value function , this โ€œoracleโ€ as David Silver calls it, which tells us how much we are wrong for . However, in RL, there is no supervisor, only rewards. We donโ€™t have . In practice, we substitute a target for .

  • In MC, we replace with
  • In TD(0), we replace with the TD-target
  • In TD(), we replace with the -return

Approximation by Gradient Descent

Goal: find parameter vector w minimizing mean-squared error between approximate value fn ห†v(s, w) and true value fn vฯ€(s)

We defined the loss as

We can use gradient descent finds a local minimum After applying chain rule, we get

Instead of doing the full gradient descent, we can sample the gradient with stochastic gradietn descent.

Expexted update is equal to full gradient update.

TD Learning with VFA

TD is biased. 3 forms of approximation:

  • function approximation
  • bootstrapping
  • sampling

VFA for Control

Policy Evaluation - Approximate policy evaluation