abstract:047fc72e16dd3bfa.tex

1: \begin{abstract}

2: Thompson sampling has been shown to be an effective policy across a variety of online learning tasks. Many works have analyzed the finite time performance of Thompson sampling, and proved that it achieves a sub-linear regret under a broad range of probabilistic settings. However its asymptotic behavior remains mostly underexplored. In this paper, we prove an asymptotic convergence result for Thompson sampling under the assumption of a

3: sub-linear Bayesian regret, and show that the actions of a Thompson

4: sampling agent  provide a strongly

5: consistent estimator of the optimal action. Our results rely on the martingale structure inherent in Thompson sampling.

6: \end{abstract}

7: