abstract:ef5ed4cfc387e20f.tex

1: \begin{abstract}

2: Doubly robust methods hold considerable

3: promise for off-policy evaluation in Markov decision processes (MDPs) under sequential ignorability:

4: They have been shown to converge as $1/\sqrt{T}$ with the horizon $T$, to be statistically

5: efficient in large samples, and to allow for modular implementation where preliminary estimation

6: tasks can be executed using standard reinforcement learning techniques. Existing results, however,

7: make heavy use of a strong distributional overlap assumption whereby the stationary distributions of

8: the target policy and the data-collection policy are within a bounded factor of each other---and

9: this assumption is typically only credible when the state space of the MDP is bounded. In this paper, we

10: re-visit the task of off-policy evaluation in MDPs under a weaker notion of distributional overlap,

11: and introduce a class of truncated doubly robust (TDR) estimators which we find to perform well

12: in this setting. When the distribution ratio of the target

13: and data-collection policies is square-integrable (but not necessarily bounded), our approach

14: recovers the large-sample behavior previously established under strong distributional overlap.

15: When this ratio is not square-integrable, TDR is still consistent but with a slower-than-$1/\sqrt{T}$; furthermore, this rate of convergence is minimax over a class of

16: MDPs defined only using mixing conditions. We validate our approach numerically and find that,

17: in our experiments, appropriate truncation plays a major role in enabling accurate off-policy

18: evaluation when strong distributional overlap does not hold.

19: \end{abstract}

20: