1: \begin{abstract}
2: Policy gradient (PG) is a reinforcement learning (RL) approach that optimizes a parameterized policy model for an expected return using gradient ascent.
3: %
4: Given a well-parameterized policy model, such as a neural network model, with appropriate initial parameters,
5: the PG algorithms work well even when environment does not have the Markov property.
6: %
7: Otherwise, they can be trapped on a plateau or suffer from peakiness effects.
8: %
9: As another successful RL approach, algorithms based on Monte-Carlo Tree Search (MCTS), which include AlphaZero, have obtained groundbreaking results especially on the board game playing domain.
10: %
11: They are also suitable to be applied to non-Markov decision processes.
12: %
13: However, since the standard MCTS does not have the ability to learn state representation,
14: the size of the tree-search space can be too large to search.
15: %they usually require an execution of MCTS for action selection even in a test phase. Thus, the scope of those applications is limited compared to the ordinary RL methods.
16: %
17: In this work, we examine a mixture policy of PG and MCTS in order to complement each other's difficulties
18: and take advantage of them.
19: %
20: We derive conditions for asymptotic convergence with results of a two-timescale stochastic approximation
21: and propose an algorithm that satisfies these conditions.
22: % We propose a few types of mixtures and show those convergence property.
23: %
24: The effectivity of the proposed methods is verified through numerical experiments on non-Markov decision processes.
25: \end{abstract}
26: