8a24d2d0990051bd.tex
1: \begin{abstract}
2: Most of today's distributed machine learning systems
3: assume {\em reliable networks}: whenever
4: two machines exchange information (e.g., gradients
5: or models), the network should guarantee the delivery
6: of the message. At the same time, recent 
7: work exhibits 
8: the impressive tolerance of machine learning algorithms to errors or noise arising from relaxed communication or synchronization. 
9: In this paper, we connect these two trends, and consider the  
10: following question: {\em Can we design machine
11: learning systems that are tolerant to network unreliability during training?}
12: With this motivation, we focus on a theoretical
13: problem of independent interest---given a standard distributed parameter server 
14: architecture, if every communication between
15: the worker and the server has a non-zero probability $p$
16: of being dropped, does there exist 
17: an algorithm that still converges,
18: and at what speed?
19: The technical contribution of this paper is a novel theoretical analysis proving that 
20:  distributed learning over unreliable network can achieve comparable convergence rate to centralized or distributed learning over reliable networks. 
21:  Further, we prove that the influence of the packet drop rate diminishes with the growth of the number of \textcolor{black}{parameter servers}.
22: We map this theoretical result onto a real-world scenario, training deep neural networks over
23: an unreliable network layer, and conduct network simulation
24: to validate the system improvement by allowing the networks to be unreliable.
25: \end{abstract}
26: