abstract:1c0c9907147c7b16.tex

1: \begin{abstract}

2:   We revisit the problem of learning in two-player zero-sum Markov games, focusing on developing an algorithm that is \emph{uncoupled}, \emph{convergent}, and \emph{rational}, with non-asymptotic convergence rates to Nash equilibrium. We start from the case of stateless matrix game with bandit feedback as a warm-up, showing an $\order(t^{-\frac{1}{8}})$ last-iterate convergence rate. {To the best of our knowledge, this is the first result that obtains finite last-iterate convergence rate given access to only bandit feedback.} We extend our result to the case of irreducible Markov games, providing a last-iterate convergence rate of $\order(t^{-\frac{1}{9+\varepsilon}})$ for any $\varepsilon>0$. Finally, we study Markov games without any assumptions on the dynamics, and show a \textit{path convergence} rate, a new notion of convergence we define, of $\order(t^{-\frac{1}{10}})$. Our algorithm removes the coordination and prior knowledge requirement of \citep{wei2021last}, which pursued the same goals as us for irreducible Markov games. Our algorithm is related to \citep{chen2021sample, cen2021fast} and also builds on the entropy regularization technique. However, we remove their requirement of communications on the entropy values, making our algorithm entirely uncoupled.

3: \end{abstract}

4: