abstract:853844c3a315e9b4.tex

1: \begin{abstract}

2:

3:

4:

5:

6: Most commonly used distributed machine learning systems are either synchronous

7: or centralized asynchronous. Synchronous algorithms like AllReduce-SGD perform

8: poorly in a heterogeneous environment, while asynchronous algorithms using a

9: parameter server suffer from 1) communication bottleneck at parameter servers

10: when workers are many, and 2) significantly worse convergence when the traffic

11: to parameter server is congested. {\em Can we design an algorithm that is robust

12:   in a heterogeneous environment, while being communication efficient and

13:   maintaining the best-possible convergence rate?} In this paper, we propose an

14: asynchronous decentralized stochastic gradient decent algorithm (AD-PSGD)

15: satisfying all above expectations. Our theoretical analysis shows AD-PSGD

16: converges at the optimal $O(1/\sqrt{K})$ rate as SGD and has linear speedup

17: w.r.t. number of workers. Empirically, AD-PSGD outperforms the best of

18: decentralized parallel SGD (D-PSGD), asynchronous parallel SGD (A-PSGD), and

19: standard data parallel SGD (AllReduce-SGD), often by orders of magnitude in a

20: heterogeneous environment. When training ResNet-50 on ImageNet with up to 128

21: GPUs, AD-PSGD converges (w.r.t epochs) similarly to the

22: AllReduce-SGD, but each epoch can be up to 4-8$\times$ faster than its synchronous counterparts

23: in a network-sharing HPC environment. \iftoggle{icml}{}{To the best of our

24:   knowledge, AD-PSGD is the first asynchronous algorithm that achieves a similar

25:   epoch-wise convergence rate as AllReduce-SGD, at an over 100-GPU scale.}

26:

27:

28:

29:  \end{abstract}

30: