abstract:f876011de9272ea9.tex

1: \begin{abstract}

2:

3:

4: Offline Multi-Agent Reinforcement Learning (MARL) is an emerging field that aims to learn optimal multi-agent policies from pre-collected datasets. Compared to single-agent case, multi-agent setting involves a large joint state-action space and coupled behaviors of multiple agents, which bring extra complexity to offline policy optimization.

5: In this work, we revisit the existing offline MARL methods and show that in certain scenarios they can be problematic, leading to uncoordinated behaviors and out-of-distribution (OOD) joint actions.

6: To address these issues, we propose a new offline MARL algorithm, named In-Sample Sequential Policy Optimization (InSPO). InSPO sequentially updates each agent's policy in an in-sample manner, which not only avoids selecting OOD joint actions but also carefully considers teammates' updated policies to enhance coordination. Additionally, by thoroughly exploring low-probability actions in the behavior policy, InSPO can well address the issue of premature convergence to sub-optimal solutions. Theoretically, we prove InSPO guarantees monotonic policy improvement and converges to quantal response equilibrium (QRE). Experimental results demonstrate the effectiveness of our method compared to current state-of-the-art offline MARL methods.

7: % Uncomment the following to link to your code, datasets, an extended version or similar.

8: %

9: \begin{links}

10:     \link{Code}{https://github.com/kkkaiaiai/InSPO/}

11:     % \link{Datasets}{https://aaai.org/example/datasets}

12:     % \link{Extended version}{https://aaai.org/example/extended-version}

13: \end{links}

14:

15: \end{abstract}

16: