f876011de9272ea9.tex
1: \begin{abstract}
2: 
3: 
4: Offline Multi-Agent Reinforcement Learning (MARL) is an emerging field that aims to learn optimal multi-agent policies from pre-collected datasets. Compared to single-agent case, multi-agent setting involves a large joint state-action space and coupled behaviors of multiple agents, which bring extra complexity to offline policy optimization. 
5: In this work, we revisit the existing offline MARL methods and show that in certain scenarios they can be problematic, leading to uncoordinated behaviors and out-of-distribution (OOD) joint actions. 
6: To address these issues, we propose a new offline MARL algorithm, named In-Sample Sequential Policy Optimization (InSPO). InSPO sequentially updates each agent's policy in an in-sample manner, which not only avoids selecting OOD joint actions but also carefully considers teammates' updated policies to enhance coordination. Additionally, by thoroughly exploring low-probability actions in the behavior policy, InSPO can well address the issue of premature convergence to sub-optimal solutions. Theoretically, we prove InSPO guarantees monotonic policy improvement and converges to quantal response equilibrium (QRE). Experimental results demonstrate the effectiveness of our method compared to current state-of-the-art offline MARL methods.
7: % Uncomment the following to link to your code, datasets, an extended version or similar.
8: %
9: \begin{links}
10:     \link{Code}{https://github.com/kkkaiaiai/InSPO/}
11:     % \link{Datasets}{https://aaai.org/example/datasets}
12:     % \link{Extended version}{https://aaai.org/example/extended-version}
13: \end{links}
14: 
15: \end{abstract}
16: