831ce4b155478a0b.tex
1: \begin{abstract}
2: We consider model-based multi-agent reinforcement learning, where the environment transition model is unknown and can only be learned via expensive interactions with the environment. We propose \textsc{H-MARL} (Hallucinated Multi-Agent Reinforcement Learning), a novel sample-efficient algorithm that can efficiently balance \emph{exploration}, i.e., learning about the environment, and \emph{exploitation}, i.e., achieve good equilibrium performance in the underlying general-sum Markov game. \textsc{H-MARL} builds high-probability confidence intervals around the unknown transition model and sequentially updates them based on newly observed data. Using these, it constructs an \emph{optimistic} \emph{hallucinated game} for the agents for which equilibrium policies are computed at each round.
3: We consider general statistical models (e.g., Gaussian processes, deep ensembles, etc.) and policy classes (e.g., deep neural networks), and theoretically analyze our approach by bounding the agents' \emph{dynamic regret}. Moreover,  we provide a convergence rate to the equilibria of the underlying Markov game. We demonstrate our approach experimentally on an autonomous driving simulation benchmark. \textsc{H-MARL} learns successful equilibrium policies after a few interactions with the environment and can significantly improve the performance compared to non-exploratory methods.\looseness=-1
4: \end{abstract}
5: