abstract:ad389084d5cb7756.tex

1: \begin{abstract}

2:

3: Natural policy gradient (NPG) methods  are among the most widely used policy optimization algorithms in contemporary reinforcement learning.

4: This class of methods is often applied in conjunction with entropy regularization --- an algorithmic scheme that encourages exploration --- and is closely related to soft policy iteration and trust region policy optimization.  Despite the empirical success, the theoretical underpinnings for NPG methods remain limited even for the tabular setting.

5:

6:

7: This paper develops {\em non-asymptotic} convergence guarantees for entropy-regularized NPG methods under softmax parameterization, focusing on discounted Markov decision processes (MDPs). Assuming access to exact policy evaluation, we demonstrate that the algorithm converges linearly --- even quadratically once it enters a local region around the optimal policy --- when computing optimal value functions of the regularized MDP.

8: % The iteration complexity depends only logarithmically on the problem dimension.

9: Moreover, the algorithm is provably stable vis-\`a-vis inexactness of policy evaluation. Our convergence results accommodate a wide range of learning rates, and shed light upon the role of entropy regularization in enabling fast convergence.

10:

11: %outperform the ones established for unregularized NPG methods \citep{agarwal2019optimality}, and

12:

13: \end{abstract}

14: