abstract:38e627cf94807cc9.tex

1: \begin{abstract}

2: We prove that optimistic-follow-the-regularized-leader (OFTRL), together

3: with smooth value updates, finds an $O(T^{-1})$-approximate Nash

4: equilibrium in $T$ iterations for two-player zero-sum Markov games

5: with full information. This improves the $\tilde{O}(T^{-5/6})$ convergence

6: rate recently shown in the paper~\cite{zhang2022policy}. The refined

7: analysis hinges on two essential ingredients. First, the sum of the

8: regrets of the two players, though not necessarily non-negative as

9: in normal-form games, is approximately non-negative in Markov games.

10: This property allows us to bound the second-order path lengths of

11: the learning dynamics. Second, we prove a tighter algebraic inequality

12: regarding the weights deployed by OFTRL that shaves an extra $\log T$

13: factor. This crucial improvement enables the inductive analysis that

14: leads to the final $O(T^{-1})$ rate.

15: \end{abstract}

16: