853844c3a315e9b4.tex
1: \begin{abstract}
2: 
3: 
4: 
5: 
6: Most commonly used distributed machine learning systems are either synchronous
7: or centralized asynchronous. Synchronous algorithms like AllReduce-SGD perform
8: poorly in a heterogeneous environment, while asynchronous algorithms using a
9: parameter server suffer from 1) communication bottleneck at parameter servers
10: when workers are many, and 2) significantly worse convergence when the traffic
11: to parameter server is congested. {\em Can we design an algorithm that is robust
12:   in a heterogeneous environment, while being communication efficient and
13:   maintaining the best-possible convergence rate?} In this paper, we propose an
14: asynchronous decentralized stochastic gradient decent algorithm (AD-PSGD)
15: satisfying all above expectations. Our theoretical analysis shows AD-PSGD
16: converges at the optimal $O(1/\sqrt{K})$ rate as SGD and has linear speedup
17: w.r.t. number of workers. Empirically, AD-PSGD outperforms the best of
18: decentralized parallel SGD (D-PSGD), asynchronous parallel SGD (A-PSGD), and
19: standard data parallel SGD (AllReduce-SGD), often by orders of magnitude in a
20: heterogeneous environment. When training ResNet-50 on ImageNet with up to 128
21: GPUs, AD-PSGD converges (w.r.t epochs) similarly to the
22: AllReduce-SGD, but each epoch can be up to 4-8$\times$ faster than its synchronous counterparts
23: in a network-sharing HPC environment. \iftoggle{icml}{}{To the best of our
24:   knowledge, AD-PSGD is the first asynchronous algorithm that achieves a similar
25:   epoch-wise convergence rate as AllReduce-SGD, at an over 100-GPU scale.}
26: 
27: 
28: 
29:  \end{abstract}
30: