1: \begin{abstract}
2:
3: Natural policy gradient (NPG) methods are among the most widely used policy optimization algorithms in contemporary reinforcement learning.
4: This class of methods is often applied in conjunction with entropy regularization --- an algorithmic scheme that encourages exploration --- and is closely related to soft policy iteration and trust region policy optimization. Despite the empirical success, the theoretical underpinnings for NPG methods remain limited even for the tabular setting.
5:
6:
7: This paper develops {\em non-asymptotic} convergence guarantees for entropy-regularized NPG methods under softmax parameterization, focusing on discounted Markov decision processes (MDPs). Assuming access to exact policy evaluation, we demonstrate that the algorithm converges linearly --- even quadratically once it enters a local region around the optimal policy --- when computing optimal value functions of the regularized MDP.
8: % The iteration complexity depends only logarithmically on the problem dimension.
9: Moreover, the algorithm is provably stable vis-\`a-vis inexactness of policy evaluation. Our convergence results accommodate a wide range of learning rates, and shed light upon the role of entropy regularization in enabling fast convergence.
10:
11: %outperform the ones established for unregularized NPG methods \citep{agarwal2019optimality}, and
12:
13: \end{abstract}
14: