1: \begin{abstract}
2: Doubly robust methods hold considerable
3: promise for off-policy evaluation in Markov decision processes (MDPs) under sequential ignorability:
4: They have been shown to converge as $1/\sqrt{T}$ with the horizon $T$, to be statistically
5: efficient in large samples, and to allow for modular implementation where preliminary estimation
6: tasks can be executed using standard reinforcement learning techniques. Existing results, however,
7: make heavy use of a strong distributional overlap assumption whereby the stationary distributions of
8: the target policy and the data-collection policy are within a bounded factor of each other---and
9: this assumption is typically only credible when the state space of the MDP is bounded. In this paper, we
10: re-visit the task of off-policy evaluation in MDPs under a weaker notion of distributional overlap,
11: and introduce a class of truncated doubly robust (TDR) estimators which we find to perform well
12: in this setting. When the distribution ratio of the target
13: and data-collection policies is square-integrable (but not necessarily bounded), our approach
14: recovers the large-sample behavior previously established under strong distributional overlap.
15: When this ratio is not square-integrable, TDR is still consistent but with a slower-than-$1/\sqrt{T}$; furthermore, this rate of convergence is minimax over a class of
16: MDPs defined only using mixing conditions. We validate our approach numerically and find that,
17: in our experiments, appropriate truncation plays a major role in enabling accurate off-policy
18: evaluation when strong distributional overlap does not hold.
19: \end{abstract}
20: