abstract:579cc710e2946c51.tex

1: \begin{abstract}

2: We study the scalable  multi-agent reinforcement learning (MARL) with general utilities, defined as nonlinear functions of the team's long-term state-action occupancy measure.

3: The objective is to find a localized policy that maximizes the average of the team's local utility functions without the full observability of each agent in the team.

4: % that depends only on states of each agent's $\kappa$-hop neighbors

5: % such that it does not require the full observability for each agent in team and its complexity does not depend on the entire state-action space size that scales exponentially in the number of agents.

6: By exploiting the spatial correlation decay property of the network structure, we propose a scalable distributed policy gradient algorithm with shadow reward and localized policy that consists of three steps: (1) shadow reward estimation, (2) truncated shadow Q-function estimation, and (3) truncated policy gradient estimation and policy update. Our algorithm converges, with high probability, to $\epsilon$-stationarity with $\widetilde{\mc{O}}(\epsilon^{-2})$ samples up to some approximation error that decreases exponentially in the communication radius.

7: This is the first result in the literature on multi-agent RL with general utilities that does not require the full observability.

8: % This is is the first result in the literature on multi-agent RL with general utilities.

9: % with the error term $\mc{O}\left(n\phi_2^{2\kappa} + \sum_{i\in \mcN}\frac{|\mcNk_i|^2}{n^2} \phi_1^{2\kappa}\right)$

10: % where $\phi_1,\phi_2 \in (0,1)$, $n$ is the number of agents, $\mcN$ is the set of agents and $\mcNk_i$ is the set of agents in the $\kappa$-hop neighborhood of agent $i$.

11: % Our results can also generalize to the global optimality convergence if the utility functions are concave and the policy parameterization satisfies some mild conditions.

12: \end{abstract}

13: