Mean Squared Bellman Error (MSBE)
Notice vs. ? This is really really important:
- If you directly minimize MSBE with gradient descent, the target itself changes when you update your network, which makes learning unstable.
Why
Q_{\theta_{target}}
, why not just useQ_\theta
as the target?That is what we do in the Tabular Q-Learning case, because every time we update the Q-table, only 1 entry changes.
HOWEVER, in the continuous case, when we do a gradient update on , any parameter change can drastically alter the function landscape. If you set the target with the same and do a gradient update, this target will shift as fast as your weights do. This feedback loop causes divergence and instability.
- Taken from Deep Q-Learning note
All deep networks use MSBE so that you can properly let the gradients flow as opposed to the Bellman Optimality Update: