abstract:5515e40a7c599a63.tex

1: \begin{abstract}

2: Finding the optimal signal timing strategy is a difficult task for the problem of large-scale traffic signal control (TSC). Multi-Agent Reinforcement Learning (MARL) is a promising method to solve this problem.

3: However,  there is still room for improvement in extending to large-scale problems and modeling the behaviors of other agents for each individual agent.

4: In this paper, a new MARL, called  \emph{Cooperative double Q-learning} (Co-DQL), is proposed, which has several prominent features. It uses a highly scalable independent double Q-learning method based on double estimators and the upper confidence bound (UCB) policy, which can eliminate the over-estimation problem existing in traditional independent Q-learning while ensuring exploration. It uses mean field approximation to model the interaction among agents, thereby making agents learn a better cooperative strategy.

5: In order to improve the stability and  robustness of the learning process, we introduce a new reward allocation mechanism and a local state sharing method.

6: In addition, we analyze the convergence properties of the proposed algorithm.

7: Co-DQL is applied to TSC and tested on various traffic flow scenarios of TSC simulators. The results show that Co-DQL outperforms the state-of-the-art decentralized MARL algorithms in terms of multiple traffic metrics.

8: \end{abstract}

9: