8960312a2876059e.tex
1: \begin{abstract}
2: We study the problem of learning a Nash equilibrium (NE) in an imperfect information game (IIG) through self-play. Precisely, we focus on two-player, zero-sum, episodic, tabular IIG under the \textit{perfect-recall} assumption where the only feedback is realizations of the game (bandit feedback). In particular, the \textit{dynamics of the IIG is not known}---we can only access it by sampling or interacting with a game simulator. For this learning setting, we provide the Implicit Exploration Online Mirror Descent (\ouralgo) algorithm. It is a model-free algorithm with a high-probability bound on the convergence rate to the NE of order $1/\sqrt{T}$ where~$T$ is the number of played games. Moreover, \ouralgo is computationally efficient as it needs to perform the \textit{updates only along the sampled trajectory}.
3: \end{abstract}
4: