1: \begin{abstract}
2: The optimistic nature of the $Q-$learning target leads to an overestimation
3: bias, which is an inherent problem associated with standard $Q-$learning. Such a bias fails to account for the possibility of low returns, particularly in risky scenarios.
4: However, the existence of biases, whether overestimation or underestimation,
5: need not necessarily be undesirable. In this paper, we analytically
6: examine the utility of biased learning, and show that specific types
7: of biases may be preferable, depending on the scenario. Based on this
8: finding, we design a novel reinforcement learning algorithm, \emph{Balanced Q-learning},
9: in which the target is modified to be a convex combination of a pessimistic
10: and an optimistic term, whose associated weights are determined online,
11: analytically. We prove the convergence of this algorithm in a tabular
12: setting, and empirically demonstrate its superior learning performance
13: in various environments.%, where it either matches or outperforms other competing approaches.
14: \end{abstract}
15: