ad389084d5cb7756.tex
1: \begin{abstract}
2: 
3: Natural policy gradient (NPG) methods  are among the most widely used policy optimization algorithms in contemporary reinforcement learning. 
4: This class of methods is often applied in conjunction with entropy regularization --- an algorithmic scheme that encourages exploration --- and is closely related to soft policy iteration and trust region policy optimization.  Despite the empirical success, the theoretical underpinnings for NPG methods remain limited even for the tabular setting. 
5: 
6: 
7: This paper develops {\em non-asymptotic} convergence guarantees for entropy-regularized NPG methods under softmax parameterization, focusing on discounted Markov decision processes (MDPs). Assuming access to exact policy evaluation, we demonstrate that the algorithm converges linearly --- even quadratically once it enters a local region around the optimal policy --- when computing optimal value functions of the regularized MDP. 
8: % The iteration complexity depends only logarithmically on the problem dimension. 
9: Moreover, the algorithm is provably stable vis-\`a-vis inexactness of policy evaluation. Our convergence results accommodate a wide range of learning rates, and shed light upon the role of entropy regularization in enabling fast convergence. 
10:  
11: %outperform the ones established for unregularized NPG methods \citep{agarwal2019optimality}, and 
12: 
13: \end{abstract}
14: