1c77bf6f77a813ef.tex
1: \begin{abstract}
2:   Various algorithms for reinforcement learning (RL) exhibit dramatic
3:   variation in their convergence rates as a function of problem
4:   structure. Such problem-dependent behavior is not captured by
5:   worst-case analyses and has accordingly inspired a growing effort in
6:   obtaining instance-dependent guarantees and deriving
7:   instance-optimal algorithms for RL problems. This research has been
8:   carried out, however, primarily within the confines of theory,
9:   providing guarantees that explain \textit{ex post} the performance
10:   differences observed. A natural next step is to convert these
11:   theoretical guarantees into guidelines that are useful in
12:   practice. We address the problem of obtaining sharp
13:   instance-dependent confidence regions for the policy evaluation
14:   problem and the optimal value estimation problem of an MDP, given
15:   access to an instance-optimal algorithm.  As a consequence, we
16:   propose a data-dependent stopping rule for instance-optimal
17:   algorithms.  The proposed stopping rule adapts to the
18:   instance-specific difficulty of the problem and allows for early
19:   termination for problems with favorable structure.
20: \end{abstract}
21: