abstract:2aaff06db26cc6a4.tex

1: \begin{abstract}

2: We consider infinite-horizon discounted Markov decision problems with finite state and action spaces.

3: We show that with direct parametrization in the policy space, the weighted value function, although non-convex in general, is both quasi-convex and quasi-concave.

4: While quasi-convexity helps explain the convergence of policy gradient methods to global optima, quasi-concavity hints at their convergence guarantees using arbitrarily large step sizes that are not dictated by the Lipschitz constant charactering smoothness of the value function.

5: In particular, we show that when using geometrically increasing step sizes, a general class of policy mirror descent methods, including the natural policy gradient method and a projected Q-descent method, all enjoy a linear rate of convergence without relying on entropy or other strongly convex regularization.

6: In addition, we develop a theory of weak gradient-mapping dominance and use it to prove sharper sublinear convergence rate of the projected policy gradient method.

7: Finally, we also analyze the convergence rate of an inexact policy mirror descent method and estimate its sample complexity under a simple generative model.

8: %Finally, we also estimate the sample complexity of stochastic policy mirror descent methods under a simple simulation model.

9: \end{abstract}

10: