Value Function Approximation
CS287 - Advanced Robotics
I think itโs this idea that you hand engineering you value function as a sum of features .
Whereas before we defined our value function in terms of a Bellman Equation
Example 1: Tetris
See Value Iteration where we show the implementation combining that.
Intuition
So far we have represented value function by a lookup table, i.e. every state has an entry on the table, or every state-action pair has an entry on the table.
However, with large MDPs:
- There are too many states and/or actions to store in memory
- It is too slow to learn the value of each state individually
Real-world problems oftentimes have enormous state and/or actions spaces, so the tabular representation is insufficient.
Solution for large MDPs:
- Estimate value function with function approximation
David Silver: To update w, we use MC or TD Learning.
So what Function Approximators should we use? There are several options, but we focus on differentiable function approximations:
- Linear Feature Representations
- Neural Network
Linear Feature Representations
Some thoughts afterwards! Connections
This is actually the same thing as the gradient descent case below, except the equation gets simplified to
Let be the feature vector,
Represent value function by a linear combination of features
Objective function is quadratic in parameters :
Stochastic gradient descent converges on global optimum Update rule is particularly simple Update = step-size ร prediction error ร feature value
Notice in the above how all of this works assuming that we have this true value function , this โoracleโ as David Silver calls it, which tells us how much we are wrong for . However, in RL, there is no supervisor, only rewards. We donโt have . In practice, we substitute a target for .
- In MC, we replace with
- In TD(0), we replace with the TD-target
- In TD(), we replace with the -return
Approximation by Gradient Descent
Goal: find parameter vector w minimizing mean-squared error between approximate value fn หv(s, w) and true value fn vฯ(s)
We defined the loss as
We can use gradient descent finds a local minimum After applying chain rule, we get
Instead of doing the full gradient descent, we can sample the gradient with stochastic gradietn descent.
Expexted update is equal to full gradient update.
TD Learning with VFA
TD is biased. 3 forms of approximation:
- function approximation
- bootstrapping
- sampling
VFA for Control
Policy Evaluation - Approximate policy evaluation