abstract:534710165ea77388.tex

1: \begin{abstract}%

2:     We study infinite-horizon discounted two-player zero-sum Markov games, and develop a decentralized algorithm that provably converges to the set of Nash equilibria under self-play. Our algorithm is based on running an Optimistic Gradient Descent Ascent algorithm on each state to learn the policies, with a critic that slowly learns the value of each state. To the best of our knowledge, this is the first algorithm in this setting that is simultaneously {\it rational} (converging to the opponent's best response when it uses a stationary policy), {\it convergent} (converging to the set of Nash equilibria under self-play), {\it agnostic} (no need to know the actions played by the opponent), {\it symmetric} (players taking symmetric roles in the algorithm), and enjoying a {\it finite-time last-iterate convergence} guarantee, all of which are desirable properties of decentralized algorithms.

3: \end{abstract}