abstract:aa041a9614ae29f8.tex

1: \begin{abstract}

2: Monte-Carlo planning, as exemplified by Monte-Carlo Tree Search (MCTS), has demonstrated remarkable performance in applications with finite spaces. In this paper, we consider Monte-Carlo planning in an environment with continuous state-action spaces, a much less understood problem with important applications in control and robotics. We introduce \texttt{POLY-HOOT}, an algorithm that augments MCTS with a continuous armed bandit strategy named Hierarchical Optimistic Optimization (HOO) \citep{bubeck2011x}. Specifically, we enhance HOO by using an appropriate \emph{polynomial}, rather than \emph{logarithmic}, bonus term in the upper confidence bounds.  Such a polynomial bonus is motivated by its empirical successes in AlphaGo Zero~\citep{silver2017mastering}, as well as its significant role in achieving theoretical guarantees of finite space MCTS~\citep{shah2019reinforcement}. We investigate, for the first time, the regret of the enhanced HOO algorithm in non-stationary bandit problems. Using this result as a building block, we establish non-asymptotic convergence guarantees for \texttt{POLY-HOOT}: the value estimate converges to an arbitrarily small neighborhood of the optimal value function at a polynomial rate. We further provide experimental results that corroborate our theoretical findings.

3: \end{abstract}

4: