d6953ce59d0bcf5f.tex
1: \begin{abstract}
2: %We propose the homotopic policy mirror descent (HPMD) method for solving discounted, infinite horizon MDPs with finite state and action space, and study its convergence properties.
3: %We report several findings that seem to be new in the literature of policy gradient methods:
4: %(1) HPMD exhibits global linear convergence of the value optimality gap, and local superlinear convergence of both the policy and optimality gap with order $\gamma^{-2}$, where $\gamma$ denotes the discount factor. The superlinear convergence  takes effect after no more than $\cO(\log(1/\Delta^*))$ number of iterations,  where $\Delta^*$ is defined via a gap quantity associated with the optimal state-action value function;
5: %(2) HPMD also exhibits the last-iterate convergence of the policy, with the limiting policy corresponding exactly to the optimal policy with the maximal entropy for every state.
6: %%No regularization is added to the policy optimization objective and hence the second observation arises solely as an algorithmic property of the HPMD method;
7: %(3) Both the local acceleration and last-iterate policy convergence of HPMD hold for a much broader class of decomposable Bregman divergences, including the $p$-th power of $\ell_p$-norm and the negative Tsallis entropy. As a byproduct of the analysis, we also discover the finite-time exact convergence of HPMD with these divergences, and show that HPMD continues converging to the limiting policy even if the current policy is already optimal;
8: %(4) For the stochastic HPMD method, we further demonstrate  that for small optimality gap $\epsilon$, a better than $\tilde{\cO}(\abs{\cS} \abs{\cA} / \epsilon^2)$ sample complexity holds with high probability, when assuming a generative model for policy evaluation.
9: %
10: %
11: %\keywords{policy gradient method \and local acceleration \and policy convergence \and sample complexity}
12: %% \PACS{PACS code1 \and PACS code2 \and more}
13: %\subclass{90C40 \and 90C15 \and 90C26 \and 68Q25}
14: %\end{abstract}
15: