190a27c3c4a0c679.tex
1: \begin{abstract}
2: Intensive communication and synchronization cost for gradients and parameters is the well-known bottleneck of
3: distributed deep learning training. Based on the observations that Synchronous SGD (SSGD) obtains good convergence
4: accuracy while asynchronous SGD (ASGD) delivers a faster raw training speed, we propose Several Steps Delay SGD (SSD-SGD)  to combine their merits, aiming at tackling the communication bottleneck via communication sparsification.
5: SSD-SGD explores both global synchronous updates in the parameter servers and asynchronous local updates in the workers in each periodic iteration. The periodic and flexible synchronization makes SSD-SGD achieve good
6: convergence accuracy and fast training speed. To the best of our knowledge, we strike the new balance between
7: synchronization quality and communication sparsification, and improve the trade-off between accuracy and training
8: speed. Specifically, the core components of SSD-SGD include proper warm-up stage, steps delay stage, and our novel
9: algorithm of global gradient for local update (GLU). GLU is critical for local update operations to effectively
10: compensate the delayed local weights. Furthermore, we implement SSD-SGD on MXNet framework and
11: comprehensively evaluate its performance with CIFAR-10 and ImageNet datasets. Experimental results show that
12: SSD-SGD can accelerate distributed training speed under different experimental configurations, by up to 110\%, while
13: achieving good convergence accuracy.
14: 	
15: \end{abstract}
16: