1: \begin{abstract}
2: Reinforcement learning lies at the intersection of several challenges.
3: Many applications of interest involve extremely large state spaces, requiring \textit{function approximation} to enable tractable computation.
4: In addition, the learner has only a single stream of experience with which to evaluate a large number of possible courses of action, necessitating
5: algorithms which can learn \textit{off-policy}.
6: However, the combination of off-policy learning with function approximation
7: leads to divergence of temporal difference methods.
8: Recent work into gradient-based temporal difference methods has promised a path to stability, but at the cost of expensive hyperparameter tuning.
9: In parallel, progress in online learning has provided parameter-free methods that achieve minimax optimal guarantees up to logarithmic terms, but their application in reinforcement learning has yet to be explored.
10: In this work, we combine these two lines of attack, deriving parameter-free, gradient-based temporal difference algorithms. Our algorithms run in linear time and achieve high-probability convergence guarantees matching those of GTD2 up to $\log$ factors.
11: Our experiments demonstrate that our methods maintain high prediction performance relative to fully-tuned baselines, with no tuning whatsoever.
12: \end{abstract}
13: