abstract:abc453311379ea0b.tex

1: \begin{abstract}%   <- trailing '%' for backward compatibility of .sty file

2: Off-policy learning enables a reinforcement learning (RL) agent to reason counterfactually about policies that are not executed

3: and is one of the most important ideas in RL.

4: It, however,

5: can lead to instability when combined with function approximation and bootstrapping,

6: two arguably indispensable ingredients for large-scale reinforcement learning.

7: This is the notorious deadly triad.

8: Gradient Temporal Difference (GTD) is one powerful tool to solve the deadly triad.

9: Its success results from solving a doubling sampling issue indirectly with weight duplication or Fenchel duality.

10: In this paper,

11: we instead propose a direct method to solve the double sampling issue by simply using two samples in a Markovian data stream with an increasing gap.

12: The resulting algorithm is as computationally efficient as GTD but gets rid of GTD's extra weights.

13: The only price we pay is a logarithmically increasing memory as time progresses.

14: We provide both asymptotic and finite sample analysis,

15: where the convergence rate is on-par with the canonical on-policy temporal difference learning.

16: Key to our analysis is a novel refined discretization of limiting ODEs.

17: \end{abstract}

18: