52512f4fe0b0eb08.tex
1: \begin{abstract}
2: The deadly triad refers to the instability of a reinforcement learning algorithm when it employs off-policy learning, 
3: function approximation,
4: and bootstrapping simultaneously.
5: In this paper,
6: we investigate the target network as a tool for breaking the deadly triad,
7: providing theoretical support for the conventional wisdom that a target network stabilizes training.
8: We first propose and analyze a novel target network update rule which augments the commonly used Polyak-averaging style update with two projections. 
9: We then apply the target network and ridge regularization in several divergent algorithms and show their convergence to regularized TD fixed points.
10: Those algorithms 
11: are off-policy with linear function approximation and bootstrapping,
12: spanning both policy evaluation and control, as well as
13: both discounted and average-reward settings.
14: In particular,
15: we provide the first convergent linear $Q$-learning algorithms under nonrestrictive and changing behavior policies without bi-level optimization.
16: \end{abstract}
17: