Maximization Bias

This is an idea that even though each estimate of the state-action values is unbiased, the estimate of $π$ ’s value of $V$ can be biased.

See video

“All the control algorithms that we have discussed so far involve maximization in the construction of their target policies. For example, in Q-Learning the target policy is the greedy policy given the current action values, which is defined with a max, and in Sarsa the policy is often $ϵ$ -greedy, which also involves a maximization operation. In these algorithms, a maximum over estimated values is used implicitly as an estimate of the maximum value, which can lead to a significant positive bias.

Example

Consider a single state $s$ where there are many actions $a$ whose true values, $q (s, a)$ , are all zero but whose estimated values, $Q (s, a)$ , are uncertain and thus distributed some above and some below zero. The maximum of the true values is zero, but the maximum of the estimates is positive, a positive bias. We call this maximization bias“.

Double Q-Learning

🛠️ Steven Gong

Table of Contents

Maximization Bias

Example

Next

Graph View

Backlinks