c8ea41c530701d29.tex
1: \begin{abstract}%
2: %  Recently, deep Q-learning has captured significant attentions in the reinforcement learning (RL) community for outperforming human in several challenging tasks. Its key component is the use of target networks, which is known to stabilize the algorithm.
3:  The use of target networks is a common practice in deep reinforcement learning for stabilizing the training; however, theoretical understanding of this technique is still limited. In this paper, we study the so-called \emph{periodic Q-learning} algorithm (PQ-learning for short), which resembles the technique used in deep Q-learning for solving infinite-horizon discounted Markov decision processes (DMDP) in the tabular setting. PQ-learning maintains two separate Q-value estimates -- the online estimate and target estimate. The online estimate follows the standard Q-learning update, while the target estimate is updated \emph{periodically}.  In contrast to the standard Q-learning, PQ-learning enjoys a simple finite time analysis and achieves better sample complexity for finding an $\varepsilon$-optimal policy. Our result provides a preliminary justification of the effectiveness of utilizing target estimates or networks in Q-learning algorithms.
4: %  The main advantage of this algorithm is that its convergence analysis can be easily done by applying standard tools in convex and stochastic optimization areas.
5: %  From this aspect, its finite-time convergence and complexity analysis are easily provided. To the authors' knowledge, finite-time convergence analysis of the standard Q-learning for DMDPs is much more challenging, and only few results are reported in the literature.
6: 
7: %  Moreover, various improvements are possible through recent advances in stochastic optimization such as the variance reduction and acceleration techniques.
8: \end{abstract}
9: