1: \begin{abstract}
2: We present new policy mirror descent (PMD) methods
3: for solving
4: reinforcement learning (RL) problems
5: with either strongly convex or general convex regularizers.
6: By exploring the structural properties
7: of these overall highly nonconvex problems we show that the PMD methods
8: exhibit fast linear rate of convergence to the global optimality.
9: We develop stochastic
10: counterparts of these methods, and establish an ${\cal O}(1/\epsilon)$
11: (resp., ${\cal O}(1/\epsilon^2)$) sampling complexity for solving these RL problems with
12: strongly (resp., general) convex regularizers using different sampling schemes, where $\epsilon$
13: denote the target accuracy. We further
14: show that the complexity for computing the gradients of these regularizers, if necessary,
15: can be bounded by ${\cal O}\{(\log_\gamma \epsilon) [(1-\gamma)L/\mu]^{1/2}\log (1/\epsilon)\}$
16: (resp., ${\cal O} \{(\log_\gamma \epsilon ) (L/\epsilon)^{1/2}\}$)
17: for problems with strongly (resp., general) convex regularizers. Here $\gamma$ denotes
18: the discounting factor.
19: To the best of our knowledge, these complexity bounds,
20: along with our algorithmic developments,
21: appear to be new in both optimization and RL literature.
22: The introduction of these convex regularizers also
23: greatly enhances the flexibility and thus expands the applicability of RL models.
24: \end{abstract}
25: