36953661ee2254cf.tex
1: \begin{abstract}
2:   Various algorithms in reinforcement learning exhibit dramatic
3:   variability in their convergence rates and ultimate accuracy as a
4:   function of the problem structure.  Such instance-specific behavior
5:   is not captured by existing global minimax bounds, which are
6:   worst-case in nature.  We analyze the problem of estimating optimal
7:   $Q$-value functions for a discounted Markov decision process with
8:   discrete states and actions and identify an instance-dependent
9:   functional that controls the difficulty of estimation in the
10:   $\ell_\infty$-norm.  Using a local minimax framework, we show that
11:   this functional arises in lower bounds on the accuracy on any
12:   estimation procedure.  In the other direction, we establish the
13:   sharpness of our lower bounds, up to factors logarithmic in the
14:   state and action spaces, by analyzing a variance-reduced version of
15:   $Q$-learning.  Our theory provides a precise way of distinguishing
16:   ``easy'' problems from ``hard'' ones in the context of $Q$-learning,
17:   as illustrated by an ensemble with a continuum of difficulty.
18: \end{abstract}
19: