1: \begin{abstract}
2: We consider infinite-horizon discounted Markov decision problems with finite state and action spaces.
3: We show that with direct parametrization in the policy space, the weighted value function, although non-convex in general, is both quasi-convex and quasi-concave.
4: While quasi-convexity helps explain the convergence of policy gradient methods to global optima, quasi-concavity hints at their convergence guarantees using arbitrarily large step sizes that are not dictated by the Lipschitz constant charactering smoothness of the value function.
5: In particular, we show that when using geometrically increasing step sizes, a general class of policy mirror descent methods, including the natural policy gradient method and a projected Q-descent method, all enjoy a linear rate of convergence without relying on entropy or other strongly convex regularization.
6: In addition, we develop a theory of weak gradient-mapping dominance and use it to prove sharper sublinear convergence rate of the projected policy gradient method.
7: Finally, we also analyze the convergence rate of an inexact policy mirror descent method and estimate its sample complexity under a simple generative model.
8: %Finally, we also estimate the sample complexity of stochastic policy mirror descent methods under a simple simulation model.
9: \end{abstract}
10: