abstract:6f45b08c69f4dc93.tex

1: \begin{abstract}

2:

3: When counterfactual risk minimization (CRM) fails, that is when the logging data does not suffice to find an improved policy, it is often natural to collect additional data. While it would be possible to do so with the logging system, the already collected data could be used to design sequential and adaptive data collections. In this work, we extend the CRM principle to the sequential counterfactual risk minimization (SCRM) to set a framework for these sequential designs. For parametric policies, we first demonstrate variance-dependent convergence guarantees in (CRM), that can be leveraged to accelerate the convergence rates when using sequential designs under a Holderian Error Bound assumption. In particular, we show in the best cases how the regret rates of (CRM) in $O(\sqrt{n})$ can be improved to $O(\log^2{n})$ in (SCRM), where $n$ is the total sample size, using an algorithm that is akin to restart strategies in acceleration methods in optimization. Eventually, we provide an empirical evaluation of our method in both discrete and continuous action settings that confirms the benefits of this essential yet understudied setting.

4:

5: \end{abstract}

6: