38e627cf94807cc9.tex
1: \begin{abstract}
2: We prove that optimistic-follow-the-regularized-leader (OFTRL), together
3: with smooth value updates, finds an $O(T^{-1})$-approximate Nash
4: equilibrium in $T$ iterations for two-player zero-sum Markov games
5: with full information. This improves the $\tilde{O}(T^{-5/6})$ convergence
6: rate recently shown in the paper~\cite{zhang2022policy}. The refined
7: analysis hinges on two essential ingredients. First, the sum of the
8: regrets of the two players, though not necessarily non-negative as
9: in normal-form games, is approximately non-negative in Markov games.
10: This property allows us to bound the second-order path lengths of
11: the learning dynamics. Second, we prove a tighter algebraic inequality
12: regarding the weights deployed by OFTRL that shaves an extra $\log T$
13: factor. This crucial improvement enables the inductive analysis that
14: leads to the final $O(T^{-1})$ rate. 
15: \end{abstract}
16: