abstract:74ac38a6c15b6d0f.tex

1: \begin{abstract}

2: The optimistic nature of the $Q-$learning target leads to an overestimation

3: bias, which is an inherent problem associated with standard $Q-$learning. Such a bias fails to account for the possibility of low returns, particularly in risky scenarios.

4: However, the existence of biases, whether overestimation or underestimation,

5: need not necessarily be undesirable. In this paper, we analytically

6: examine the utility of biased learning, and show that specific types

7: of biases may be preferable, depending on the scenario. Based on this

8: finding, we design a novel reinforcement learning algorithm, \emph{Balanced Q-learning},

9: in which the target is modified to be a convex combination of a pessimistic

10: and an optimistic term, whose associated weights are determined online,

11: analytically. We prove the convergence of this algorithm in a tabular

12: setting, and empirically demonstrate its superior learning performance

13: in various environments.%, where it either matches or outperforms other competing approaches.

14: \end{abstract}

15: