abstract:8d1b96f3ea81694f.tex

1: \begin{abstract}

2: We consider Monte-Carlo Tree Search (MCTS) applied to Markov Decision

3: Processes (MDPs) and Partially Observable MDPs (POMDPs), and the

4: well-known Upper Confidence bound for Trees (UCT) algorithm. In UCT, a

5: tree with nodes (states) and edges (actions) is incrementally built by

6: the expansion of nodes, and the values of nodes are updated through a

7: backup strategy based on the average value of child nodes. However, it

8: has been shown that with enough samples the maximum operator yields

9: more accurate node value estimates than averaging. Instead of settling for one of these value

10: estimates, we go a step

11: further proposing a novel backup strategy which uses the power mean

12: operator, which computes a value between the average

13: and maximum value. We call our new approach \alg, and argue how the

14: use of the power mean operator helps to speed up the learning in

15: MCTS. We theoretically analyze our method providing guarantees of

16: convergence to the optimum. Finally, we

17: empirically demonstrate the effectiveness of our method in well-known

18: MDP and POMDP benchmarks, showing significant improvement in

19: performance and convergence speed w.r.t. state of the art algorithms.

20: \end{abstract}

21: