6c48129f5d550a4d.tex
1: \begin{abstract} 
2: Multi-agent interactions are increasingly important in the context of  reinforcement learning,   and the theoretical foundations of policy gradient methods have attracted surging research interest. We investigate  the global convergence of natural policy gradient (NPG) algorithms in multi-agent learning. 
3: We first show that vanilla NPG may not have {\it parameter convergence}, i.e., the convergence of the vector  that parameterizes the policy, even when the costs are regularized (which {enabled} strong  convergence guarantees in the {\it policy space} in the literature). This non-convergence of parameters leads to stability issues in learning, which becomes especially relevant in the function approximation setting, where we can only operate on low-dimensional parameters, instead of the high-dimensional policy.  
4: We then propose {variants} of the NPG algorithm, for several  standard multi-agent learning scenarios: two-player zero-sum matrix and Markov games, and multi-player monotone games,  
5: %We analyze both tabular and the function approximation settings for these scenarios. 
6: %To the best of our knowledge, this is the first class of PG algorithms that have 
7: with global last-iterate  parameter convergence guarantees. We also generalize the results to certain function approximation settings.  
8: Note that in our algorithms, the agents take {\it symmetric} roles. 
9: Our results might also be of independent interest for  solving nonconvex-nonconcave minimax optimization problems with certain structures.  Simulations are  also  provided to corroborate our theoretical findings. 
10: \end{abstract}