abstract:07248fda1f6e10e0.tex

1: \begin{abstract}

2: The Q-learning algorithm is known to be affected by the \emph{maximization bias}, i.e.\ the systematic overestimation of action values, an important issue that has recently received renewed attention. Double Q-learning has been proposed as an efficient algorithm to mitigate this bias. However, this comes at the price of an \emph{underestimation} of action values, in addition to increased memory requirements and a slower convergence.

3: In this paper, we introduce a new way to address the maximization bias in the form of a ``self-correcting algorithm'' for approximating the maximum of an expected value. Our method balances the overestimation of the single estimator used in conventional Q-learning and the underestimation of the double estimator used in Double Q-learning. Applying this strategy to Q-learning results in \emph{Self-correcting Q-learning}. We show theoretically that this new algorithm enjoys the same convergence guarantees as Q-learning while being more accurate. Empirically, it performs better than Double Q-learning in domains with rewards of high variance, and it even attains faster convergence than Q-learning in domains with rewards of zero or low variance. These advantages transfer to a Deep Q Network implementation that we call \emph{Self-correcting DQN} and which outperforms regular DQN and Double DQN on several tasks in the Atari 2600 domain.

4: \end{abstract}

5: