abstract:542f6e4645e44e4f.tex

1: \begin{abstract}

2: The use of target networks has been a popular and key component of recent deep Q-learning algorithms for reinforcement learning, yet little is known from the theory side. In this work, we introduce a new family of target-based temporal difference (TD) learning algorithms and provide theoretical analysis on their convergences. In contrast to the standard TD-learning, target-based TD algorithms maintain two separate learning parameters--the target variable and online variable. Particularly, we introduce three members in the family, called the {\em averaging TD, double TD}, and {\em periodic TD}, where the target variable is updated through an averaging, symmetric, or periodic fashion, mirroring those techniques used in deep Q-learning practice.

3:

4: We establish asymptotic convergence analyses for both {\em averaging TD} and {\em double TD} and a finite sample analysis for {\em periodic TD}. In addition, we also provide some simulation results showing potentially superior convergence of these target-based TD algorithms compared to the standard TD-learning. While this work focuses on linear function approximation and policy evaluation setting, we consider this as a meaningful step towards the theoretical understanding of deep Q-learning variants with target networks.

5: \end{abstract}

6: