51878c15301a5e73.tex
1: \begin{abstract}
2:     A novel Policy Gradient (PG) algorithm, called \textit{Matryoshka Policy Gradient} (MPG), is introduced and studied, in the context of max-entropy reinforcement learning, where an agent aims at maximizing entropy bonuses additional to its cumulative rewards.
3:     MPG differs from standard PG in that it trains a sequence of policies to learn finite horizon tasks simultaneously, instead of a single policy for the single standard objective.
4:     For softmax policies, we prove convergence of MPG and global optimality of the limit by showing that the only critical point of the MPG objective is the optimal policy; these results hold true even in the case of continuous compact state space.
5:     MPG is intuitive, theoretically sound and we furthermore show that the optimal policy of the standard max-entropy objective can be approximated arbitrarily well by the optimal policy of the MPG framework.
6:     Finally, we justify that MPG is well suited when the policies are parametrized with neural networks and we provide an simple criterion to verify the global optimality of the policy at convergence.
7:     As a proof of concept, we evaluate numerically MPG on standard test benchmarks.
8: \end{abstract}
9: