Maximization Bias

This is an idea that even though each estimate of the state-action values is unbiased, the estimate of ’s value of can be biased.

See video

“All the control algorithms that we have discussed so far involve maximization in the construction of their target policies. For example, in Q-learning the target policy is the greedy policy given the current action values, which is defined with a max, and in Sarsa the policy is often -greedy, which also involves a maximization operation. In these algorithms, a maximum over estimated values is used implicitly as an estimate of the maximum value, which can lead to a significant positive bias.

Example

Consider a single state where there are many actions whose true values, , are all zero but whose estimated values, , are uncertain and thus distributed some above and some below zero. The maximum of the true values is zero, but the maximum of the estimates is positive, a positive bias. We call this maximization bias“.

Next