abstract:b2741ea6bee33829.tex

1: \begin{abstract}

2:  Policy gradient (PG) is a reinforcement learning (RL) approach that optimizes a parameterized policy model for an expected return using gradient ascent.

3: %

4:  Given a well-parameterized policy model, such as a neural network model, with appropriate initial parameters,

5:  the PG algorithms work well even when environment does not have the Markov property.

6:  %

7:  Otherwise, they can be trapped on a plateau or suffer from peakiness effects.

8:  %

9:  As another successful RL approach, algorithms based on Monte-Carlo Tree Search (MCTS), which include AlphaZero, have obtained groundbreaking results especially on the board game playing domain.

10:  %

11:  They are also suitable to be applied to non-Markov decision processes.

12:  %

13:  However, since the standard MCTS does not have the ability to learn state representation,

14: the size of the tree-search space can be too large to search.

15:  %they usually require an execution of MCTS for action selection even in a test phase. Thus, the scope of those applications is limited compared to the ordinary RL methods.

16:  %

17:  In this work, we examine a mixture policy of PG and MCTS in order to complement each other's difficulties

18: and take advantage of them.

19:  %

20:  We derive conditions for asymptotic convergence with results of a two-timescale stochastic approximation

21:  and propose an algorithm that satisfies these conditions.

22:  % We propose a few types of mixtures and show those convergence property.

23:  %

24:  The effectivity of the proposed methods is verified through numerical experiments on non-Markov decision processes.

25: \end{abstract}

26: