Mean Squared Bellman Error (MSBE)

$L (ϕ, D) = (s, a, r, s^{'}, d) \sim D E (Q_{ϕ} (s, a) - (r + γ (1 - d) max_{a^{'}} Q_{ϕ_{t a r g e t}} (s^{'}, a^{'})))^{2}$

Notice $Q_{ϕ}$ vs. $Q_{ϕ_{t a r g e t}}$ ? This is really really important:

If you directly minimize MSBE with gradient descent, the target itself changes when you update your network, which makes learning unstable.

Why $Q_{\theta_{target}}$ , why not just use $Q_\theta$ as the target?

That is what we do in the Tabular Q-Learning case, because every time we update the Q-table, only 1 entry changes.

HOWEVER, in the continuous case, when we do a gradient update on $Q_{θ}$ , any parameter change can drastically alter the function landscape. If you set the target with the same $Q_{θ}$ and do a gradient update, this target will shift as fast as your weights do. This feedback loop causes divergence and instability.

Taken from Deep Q-Learning note

All deep networks use MSBE so that you can properly let the gradients flow as opposed to the Bellman Optimality Update:

🛠️ Steven Gong

Mean Squared Bellman Error (MSBE)

Graph View

Backlinks