1: \begin{abstract}
2: We study Nash equilibria learning of a general-sum stochastic game with an unknown transition probability density
3: function. Agents take actions at the current environment state and their joint action influences the transition of the environment state and their
4: immediate rewards. Each agent only observes the environment state
5: and its own immediate reward and is unknown about the actions or immediate rewards of others. We introduce the concepts of weighted asymptotic Nash equilibrium with probability $1$ and in probability. For the case with exact pseudo gradients, we design a two-loop algorithm by the equivalence of Nash equilibrium and variational inequality problems.
6: In the outer loop, we sequentially update a constructed strongly monotone variational inequality by updating a proximal parameter while employing a single-call extra-gradient algorithm in the inner loop for solving the constructed variational inequality. We show that if the associated Minty variational inequality has a solution, then the designed algorithm converges to the $k^{\frac{1}{2}}$-weighted asymptotic Nash equilibrium. Further, for the case with unknown pseudo gradients, we propose a decentralized algorithm, where the G(PO)MDP gradient estimator of the pseudo gradient is provided by Monte-Carlo simulations. The convergence to the $k^{\frac{1}{4}}$-weighted asymptotic Nash equilibrium in probability is achieved.
7: \end{abstract}
8: