579cc710e2946c51.tex
1: \begin{abstract}
2: We study the scalable  multi-agent reinforcement learning (MARL) with general utilities, defined as nonlinear functions of the team's long-term state-action occupancy measure. 
3: The objective is to find a localized policy that maximizes the average of the team's local utility functions without the full observability of each agent in the team.
4: % that depends only on states of each agent's $\kappa$-hop neighbors 
5: % such that it does not require the full observability for each agent in team and its complexity does not depend on the entire state-action space size that scales exponentially in the number of agents. 
6: By exploiting the spatial correlation decay property of the network structure, we propose a scalable distributed policy gradient algorithm with shadow reward and localized policy that consists of three steps: (1) shadow reward estimation, (2) truncated shadow Q-function estimation, and (3) truncated policy gradient estimation and policy update. Our algorithm converges, with high probability, to $\epsilon$-stationarity with $\widetilde{\mc{O}}(\epsilon^{-2})$ samples up to some approximation error that decreases exponentially in the communication radius.
7: This is the first result in the literature on multi-agent RL with general utilities that does not require the full observability.
8: % This is is the first result in the literature on multi-agent RL with general utilities.
9: % with the error term $\mc{O}\left(n\phi_2^{2\kappa} + \sum_{i\in \mcN}\frac{|\mcNk_i|^2}{n^2} \phi_1^{2\kappa}\right)$
10: % where $\phi_1,\phi_2 \in (0,1)$, $n$ is the number of agents, $\mcN$ is the set of agents and $\mcNk_i$ is the set of agents in the $\kappa$-hop neighborhood of agent $i$.
11: % Our results can also generalize to the global optimality convergence if the utility functions are concave and the policy parameterization satisfies some mild conditions.
12: \end{abstract}
13: