abstract:190a27c3c4a0c679.tex

1: \begin{abstract}

2: Intensive communication and synchronization cost for gradients and parameters is the well-known bottleneck of

3: distributed deep learning training. Based on the observations that Synchronous SGD (SSGD) obtains good convergence

4: accuracy while asynchronous SGD (ASGD) delivers a faster raw training speed, we propose Several Steps Delay SGD (SSD-SGD)  to combine their merits, aiming at tackling the communication bottleneck via communication sparsification.

5: SSD-SGD explores both global synchronous updates in the parameter servers and asynchronous local updates in the workers in each periodic iteration. The periodic and flexible synchronization makes SSD-SGD achieve good

6: convergence accuracy and fast training speed. To the best of our knowledge, we strike the new balance between

7: synchronization quality and communication sparsification, and improve the trade-off between accuracy and training

8: speed. Specifically, the core components of SSD-SGD include proper warm-up stage, steps delay stage, and our novel

9: algorithm of global gradient for local update (GLU). GLU is critical for local update operations to effectively

10: compensate the delayed local weights. Furthermore, we implement SSD-SGD on MXNet framework and

11: comprehensively evaluate its performance with CIFAR-10 and ImageNet datasets. Experimental results show that

12: SSD-SGD can accelerate distributed training speed under different experimental configurations, by up to 110\%, while

13: achieving good convergence accuracy.

14:

15: \end{abstract}

16: