1: \begin{abstract}
2: We obtain global, non-asymptotic convergence guarantees for independent learning algorithms in competitive reinforcement learning settings
3: with two agents (i.e.,~zero-sum stochastic games). We consider an episodic setting where in
4: each episode, each player independently selects a policy and observes only
5: \emph{their own} actions and rewards, along with the state. We show that
6: if both players run policy gradient methods in tandem, their policies
7: will converge to a min-max equilibrium of the game, as long as their
8: learning rates follow a two-timescale rule (which is necessary). %
9: To
10: the best of our knowledge, this constitutes the first finite-sample
11: convergence result for {independent policy gradient methods} in competitive RL; prior work has largely focused on centralized, coordinated
12: procedures for equilibrium computation. \dfcomment{do we want to keep last sentence in light of concurrent work? we could also switch ``independent learning'' to ``independent policy gradient methods''}\noah{updated to independent policy gradient...hopefully should be accurate now}
13:
14:
15:
16:
17: \end{abstract}
18: