705ab6ec482622cf.tex
1: \begin{abstract}
2: We present new policy mirror descent (PMD) methods
3: for solving
4: reinforcement learning (RL) problems
5: with either strongly convex or general convex regularizers.
6: By exploring the structural properties
7: of these overall highly nonconvex problems we show that the PMD methods
8: exhibit fast linear rate of convergence to the global optimality.
9:  We develop stochastic
10:  counterparts of these methods, and establish an ${\cal O}(1/\epsilon)$
11:  (resp., ${\cal O}(1/\epsilon^2)$) sampling complexity for solving these RL problems with
12:  strongly (resp., general) convex regularizers using different sampling schemes, where $\epsilon$
13:  denote the target accuracy. We further 
14:  show that the complexity for computing the gradients of these regularizers, if necessary,
15:  can be bounded by ${\cal O}\{(\log_\gamma \epsilon) [(1-\gamma)L/\mu]^{1/2}\log (1/\epsilon)\}$
16:  (resp., ${\cal O} \{(\log_\gamma \epsilon ) (L/\epsilon)^{1/2}\}$)
17:  for problems with strongly (resp., general) convex regularizers. Here $\gamma$ denotes
18:  the discounting factor.
19: To the best of our knowledge, these complexity bounds,
20: along with our algorithmic developments,
21: appear to be new in both optimization and RL literature.
22: The introduction of these convex regularizers also 
23: greatly enhances the flexibility and thus expands the applicability of RL models.
24: \end{abstract}
25: