abstract:a2b8124721808c3f.tex

1: \begin{abstract}

2: The goal of \emph{off-policy evaluation} (OPE) is to evaluate a new policy using historical data obtained via a \emph{behavior policy}. However, because the contextual bandit algorithm updates the policy based on past observations, the samples are not \emph{independent and identically distributed} (i.i.d.). This paper tackles this problem by constructing an estimator from a \emph{martingale difference sequence} (MDS) for the dependent samples. In the data-generating process, we do not assume the convergence of the policy, but the policy uses the same conditional probability of choosing an action during a certain period. Then, we derive an asymptotically normal estimator of the value of an \emph{evaluation policy}. As another advantage of our method, the batch-based approach simultaneously solves deficient support problem. Using benchmark and real-world datasets, we experimentally confirm the effectiveness of the proposed method.

3: \end{abstract}

4: