abstract:4cc62c416d92fc17.tex

1: \begin{abstract}

2:   Reinforcement learning lies at the intersection of several challenges.

3:   Many applications of interest involve extremely large state spaces, requiring \textit{function approximation} to enable tractable computation.

4:   In addition, the learner has only a single stream of experience with which to evaluate a large number of possible courses of action, necessitating

5:   algorithms which can learn \textit{off-policy}.

6:   However, the combination of off-policy learning with function approximation

7:   leads to divergence of temporal difference methods.

8:   Recent work into gradient-based temporal difference methods has promised a path to stability, but at the cost of expensive hyperparameter tuning.

9:   In parallel, progress in online learning has provided parameter-free methods that achieve minimax optimal guarantees up to logarithmic terms, but their application in reinforcement learning has yet to be explored.

10:   In this work, we combine these two lines of attack, deriving parameter-free, gradient-based temporal difference algorithms. Our algorithms run in linear time and achieve high-probability convergence guarantees matching those of GTD2 up to $\log$ factors.

11:   Our experiments demonstrate that our methods maintain high prediction performance relative to fully-tuned baselines, with no tuning whatsoever.

12: \end{abstract}

13: