abstract:52512f4fe0b0eb08.tex

1: \begin{abstract}

2: The deadly triad refers to the instability of a reinforcement learning algorithm when it employs off-policy learning,

3: function approximation,

4: and bootstrapping simultaneously.

5: In this paper,

6: we investigate the target network as a tool for breaking the deadly triad,

7: providing theoretical support for the conventional wisdom that a target network stabilizes training.

8: We first propose and analyze a novel target network update rule which augments the commonly used Polyak-averaging style update with two projections.

9: We then apply the target network and ridge regularization in several divergent algorithms and show their convergence to regularized TD fixed points.

10: Those algorithms

11: are off-policy with linear function approximation and bootstrapping,

12: spanning both policy evaluation and control, as well as

13: both discounted and average-reward settings.

14: In particular,

15: we provide the first convergent linear $Q$-learning algorithms under nonrestrictive and changing behavior policies without bi-level optimization.

16: \end{abstract}

17: