abstract:d6953ce59d0bcf5f.tex

1: \begin{abstract}

2: %We propose the homotopic policy mirror descent (HPMD) method for solving discounted, infinite horizon MDPs with finite state and action space, and study its convergence properties.

3: %We report several findings that seem to be new in the literature of policy gradient methods:

4: %(1) HPMD exhibits global linear convergence of the value optimality gap, and local superlinear convergence of both the policy and optimality gap with order $\gamma^{-2}$, where $\gamma$ denotes the discount factor. The superlinear convergence  takes effect after no more than $\cO(\log(1/\Delta^*))$ number of iterations,  where $\Delta^*$ is defined via a gap quantity associated with the optimal state-action value function;

5: %(2) HPMD also exhibits the last-iterate convergence of the policy, with the limiting policy corresponding exactly to the optimal policy with the maximal entropy for every state.

6: %%No regularization is added to the policy optimization objective and hence the second observation arises solely as an algorithmic property of the HPMD method;

7: %(3) Both the local acceleration and last-iterate policy convergence of HPMD hold for a much broader class of decomposable Bregman divergences, including the $p$-th power of $\ell_p$-norm and the negative Tsallis entropy. As a byproduct of the analysis, we also discover the finite-time exact convergence of HPMD with these divergences, and show that HPMD continues converging to the limiting policy even if the current policy is already optimal;

8: %(4) For the stochastic HPMD method, we further demonstrate  that for small optimality gap $\epsilon$, a better than $\tilde{\cO}(\abs{\cS} \abs{\cA} / \epsilon^2)$ sample complexity holds with high probability, when assuming a generative model for policy evaluation.

9: %

10: %

11: %\keywords{policy gradient method \and local acceleration \and policy convergence \and sample complexity}

12: %% \PACS{PACS code1 \and PACS code2 \and more}

13: %\subclass{90C40 \and 90C15 \and 90C26 \and 68Q25}

14: %\end{abstract}

15: