abstract:8a24d2d0990051bd.tex

1: \begin{abstract}

2: Most of today's distributed machine learning systems

3: assume {\em reliable networks}: whenever

4: two machines exchange information (e.g., gradients

5: or models), the network should guarantee the delivery

6: of the message. At the same time, recent

7: work exhibits

8: the impressive tolerance of machine learning algorithms to errors or noise arising from relaxed communication or synchronization.

9: In this paper, we connect these two trends, and consider the

10: following question: {\em Can we design machine

11: learning systems that are tolerant to network unreliability during training?}

12: With this motivation, we focus on a theoretical

13: problem of independent interest---given a standard distributed parameter server

14: architecture, if every communication between

15: the worker and the server has a non-zero probability $p$

16: of being dropped, does there exist

17: an algorithm that still converges,

18: and at what speed?

19: The technical contribution of this paper is a novel theoretical analysis proving that

20:  distributed learning over unreliable network can achieve comparable convergence rate to centralized or distributed learning over reliable networks.

21:  Further, we prove that the influence of the packet drop rate diminishes with the growth of the number of \textcolor{black}{parameter servers}.

22: We map this theoretical result onto a real-world scenario, training deep neural networks over

23: an unreliable network layer, and conduct network simulation

24: to validate the system improvement by allowing the networks to be unreliable.

25: \end{abstract}

26: