1: \begin{abstract}
2: Most of today's distributed machine learning systems
3: assume {\em reliable networks}: whenever
4: two machines exchange information (e.g., gradients
5: or models), the network should guarantee the delivery
6: of the message. At the same time, recent
7: work exhibits
8: the impressive tolerance of machine learning algorithms to errors or noise arising from relaxed communication or synchronization.
9: In this paper, we connect these two trends, and consider the
10: following question: {\em Can we design machine
11: learning systems that are tolerant to network unreliability during training?}
12: With this motivation, we focus on a theoretical
13: problem of independent interest---given a standard distributed parameter server
14: architecture, if every communication between
15: the worker and the server has a non-zero probability $p$
16: of being dropped, does there exist
17: an algorithm that still converges,
18: and at what speed?
19: The technical contribution of this paper is a novel theoretical analysis proving that
20: distributed learning over unreliable network can achieve comparable convergence rate to centralized or distributed learning over reliable networks.
21: Further, we prove that the influence of the packet drop rate diminishes with the growth of the number of \textcolor{black}{parameter servers}.
22: We map this theoretical result onto a real-world scenario, training deep neural networks over
23: an unreliable network layer, and conduct network simulation
24: to validate the system improvement by allowing the networks to be unreliable.
25: \end{abstract}
26: