abstract:705ab6ec482622cf.tex

1: \begin{abstract}

2: We present new policy mirror descent (PMD) methods

3: for solving

4: reinforcement learning (RL) problems

5: with either strongly convex or general convex regularizers.

6: By exploring the structural properties

7: of these overall highly nonconvex problems we show that the PMD methods

8: exhibit fast linear rate of convergence to the global optimality.

9:  We develop stochastic

10:  counterparts of these methods, and establish an ${\cal O}(1/\epsilon)$

11:  (resp., ${\cal O}(1/\epsilon^2)$) sampling complexity for solving these RL problems with

12:  strongly (resp., general) convex regularizers using different sampling schemes, where $\epsilon$

13:  denote the target accuracy. We further

14:  show that the complexity for computing the gradients of these regularizers, if necessary,

15:  can be bounded by ${\cal O}\{(\log_\gamma \epsilon) [(1-\gamma)L/\mu]^{1/2}\log (1/\epsilon)\}$

16:  (resp., ${\cal O} \{(\log_\gamma \epsilon ) (L/\epsilon)^{1/2}\}$)

17:  for problems with strongly (resp., general) convex regularizers. Here $\gamma$ denotes

18:  the discounting factor.

19: To the best of our knowledge, these complexity bounds,

20: along with our algorithmic developments,

21: appear to be new in both optimization and RL literature.

22: The introduction of these convex regularizers also

23: greatly enhances the flexibility and thus expands the applicability of RL models.

24: \end{abstract}

25: