a3b6d4c74adfadf4.tex
1: \begin{abstract}
2: In this paper, we study a fully-decentralized multi-agent policy evaluation problem, which is an important sub-problem in cooperative multi-agent reinforcement learning, in the presence of up to $f$ faulty agents. In particular, we focus on the so-called Byzantine faulty model with model poisoning setting. In general, policy evaluation is to evaluate the value function of any given policy. In cooperative multi-agent system, the system-wide rewards are usually modeled as the uniform average of rewards from all agents.  
3: We investigate the multi-agent policy evaluation problem in the presence of Byzantine agents, particularly in the setting of heterogeneous local rewards. Ideally, the goal of the agents is to evaluate the accumulated system-wide rewards, which are uniform average of rewards of the normal agents for a given policy. It means that all agents agree upon common values (the consensus part) and furthermore, the consensus values are the value functions (the convergence part).
4: However, we prove that this goal is not achievable.
5: Instead, we consider a relaxed version of the problem, where the goal of the agents is to evaluate accumulated system-wide reward, which is an appropriately weighted average reward of the normal agents.
6: We further prove that there is no correct algorithm that can guarantee that the total number of positive weights exceeds $|\mathcal{N}|-f $, where $|\mathcal{N}|$ is the number of normal agents. 
7: Towards the end, we propose a Byzantine-tolerant decentralized temporal difference algorithm that can guarantee asymptotic consensus under scalar function approximation. We then empirically test the effective of the proposed algorithm.
8: \end{abstract}
9: