1: \begin{abstract} % Abstract of not more than 200
2: %===============================================================================
3: The goal of this paper is to study a distributed temporal-difference (TD)-learning algorithm for a class of multi-agent Markov decision processes (MDPs). The single-agent TD-learning is a reinforcement learning (RL)
4: algorithm to evaluate an accumulated rewards corresponding to a given policy. In multi-agent settings, multiple RL agents concurrently behave following its own local behavior policy and learn the accumulated global rewards, which is a sum of the local rewards. The goal of each agent is to evaluate the accumulated global rewards by only receiving its local rewards. The algorithm shares learning parameters through random network communications, which have a randomly changing undirected graph structures. The problem is converted
5: into a distributed optimization problem and the corresponding saddle-point problem of its Lagrangian function. The propose TD-learning is a stochastic primal-dual algorithm to solve it. We prove finite-time convergence of the algorithm with its convergence rates and sample complexity.
6: %===============================================================================
7: \end{abstract}