abstract:51878c15301a5e73.tex

1: \begin{abstract}

2:     A novel Policy Gradient (PG) algorithm, called \textit{Matryoshka Policy Gradient} (MPG), is introduced and studied, in the context of max-entropy reinforcement learning, where an agent aims at maximizing entropy bonuses additional to its cumulative rewards.

3:     MPG differs from standard PG in that it trains a sequence of policies to learn finite horizon tasks simultaneously, instead of a single policy for the single standard objective.

4:     For softmax policies, we prove convergence of MPG and global optimality of the limit by showing that the only critical point of the MPG objective is the optimal policy; these results hold true even in the case of continuous compact state space.

5:     MPG is intuitive, theoretically sound and we furthermore show that the optimal policy of the standard max-entropy objective can be approximated arbitrarily well by the optimal policy of the MPG framework.

6:     Finally, we justify that MPG is well suited when the policies are parametrized with neural networks and we provide an simple criterion to verify the global optimality of the policy at convergence.

7:     As a proof of concept, we evaluate numerically MPG on standard test benchmarks.

8: \end{abstract}

9: