ece99dc8c1cf5284.tex
1: \begin{abstract}
2: Current approaches to multi-agent cooperation rely heavily on centralized mechanisms or explicit communication protocols to ensure convergence. 
3: This paper studies the problem of distributed multi-agent learning without resorting to explicit coordination schemes. 
4: The proposed algorithm (DM$^2$) leverages distribution matching to facilitate independent agents' coordination.
5: Each individual agent matches a target distribution of concurrently sampled trajectories from a joint expert policy. 
6: The theoretical analysis shows that under some conditions, if each agent optimizes their individual distribution matching objective, the agents increase a lower bound on the objective of matching the joint expert policy, allowing convergence to the joint expert policy.
7: Further, if the distribution matching objective is aligned with a joint task, a combination of environment reward and distribution matching reward leads to the same equilibrium.
8: Experimental validation on the StarCraft domain shows that combining the reward for distribution matching with the environment reward allows agents to outperform a fully distributed baseline.
9: Additional experiments probe the conditions under which expert demonstrations need to be sampled in order to outperform the fully distributed baseline.
10: \end{abstract}
11: