b2741ea6bee33829.tex
1: \begin{abstract}
2:  Policy gradient (PG) is a reinforcement learning (RL) approach that optimizes a parameterized policy model for an expected return using gradient ascent.
3: %
4:  Given a well-parameterized policy model, such as a neural network model, with appropriate initial parameters,
5:  the PG algorithms work well even when environment does not have the Markov property.
6:  %
7:  Otherwise, they can be trapped on a plateau or suffer from peakiness effects.
8:  %
9:  As another successful RL approach, algorithms based on Monte-Carlo Tree Search (MCTS), which include AlphaZero, have obtained groundbreaking results especially on the board game playing domain.
10:  %
11:  They are also suitable to be applied to non-Markov decision processes.
12:  %
13:  However, since the standard MCTS does not have the ability to learn state representation,
14: the size of the tree-search space can be too large to search.
15:  %they usually require an execution of MCTS for action selection even in a test phase. Thus, the scope of those applications is limited compared to the ordinary RL methods.
16:  %
17:  In this work, we examine a mixture policy of PG and MCTS in order to complement each other's difficulties
18: and take advantage of them.
19:  %
20:  We derive conditions for asymptotic convergence with results of a two-timescale stochastic approximation
21:  and propose an algorithm that satisfies these conditions.
22:  % We propose a few types of mixtures and show those convergence property.
23:  %
24:  The effectivity of the proposed methods is verified through numerical experiments on non-Markov decision processes.
25: \end{abstract}
26: