Mean Squared Bellman Error (MSBE)

Notice vs. ? This is really really important:

  • If you directly minimize MSBE with gradient descent, the target itself changes when you update your network, which makes learning unstable.

Why Q_{\theta_{target}}, why not just use Q_\theta as the target?

That is what we do in the Tabular Q-Learning case, because every time we update the Q-table, only 1 entry changes.

HOWEVER, in the continuous case, when we do a gradient update on , any parameter change can drastically alter the function landscape. If you set the target with the same and do a gradient update, this target will shift as fast as your weights do. This feedback loop causes divergence and instability.

All deep networks use MSBE so that you can properly let the gradients flow as opposed to the Bellman Optimality Update: