4cc62c416d92fc17.tex
1: \begin{abstract}
2:   Reinforcement learning lies at the intersection of several challenges.
3:   Many applications of interest involve extremely large state spaces, requiring \textit{function approximation} to enable tractable computation.
4:   In addition, the learner has only a single stream of experience with which to evaluate a large number of possible courses of action, necessitating
5:   algorithms which can learn \textit{off-policy}.
6:   However, the combination of off-policy learning with function approximation
7:   leads to divergence of temporal difference methods.
8:   Recent work into gradient-based temporal difference methods has promised a path to stability, but at the cost of expensive hyperparameter tuning.
9:   In parallel, progress in online learning has provided parameter-free methods that achieve minimax optimal guarantees up to logarithmic terms, but their application in reinforcement learning has yet to be explored.
10:   In this work, we combine these two lines of attack, deriving parameter-free, gradient-based temporal difference algorithms. Our algorithms run in linear time and achieve high-probability convergence guarantees matching those of GTD2 up to $\log$ factors.
11:   Our experiments demonstrate that our methods maintain high prediction performance relative to fully-tuned baselines, with no tuning whatsoever.
12: \end{abstract}
13: