abc453311379ea0b.tex
1: \begin{abstract}%   <- trailing '%' for backward compatibility of .sty file
2: Off-policy learning enables a reinforcement learning (RL) agent to reason counterfactually about policies that are not executed
3: and is one of the most important ideas in RL.
4: It, however, 
5: can lead to instability when combined with function approximation and bootstrapping,
6: two arguably indispensable ingredients for large-scale reinforcement learning.
7: This is the notorious deadly triad.
8: Gradient Temporal Difference (GTD) is one powerful tool to solve the deadly triad.
9: Its success results from solving a doubling sampling issue indirectly with weight duplication or Fenchel duality.
10: In this paper,
11: we instead propose a direct method to solve the double sampling issue by simply using two samples in a Markovian data stream with an increasing gap.
12: The resulting algorithm is as computationally efficient as GTD but gets rid of GTD's extra weights.
13: The only price we pay is a logarithmically increasing memory as time progresses.
14: We provide both asymptotic and finite sample analysis,
15: where the convergence rate is on-par with the canonical on-policy temporal difference learning.
16: Key to our analysis is a novel refined discretization of limiting ODEs.
17: \end{abstract}
18: