ec62b255038e8608.tex
1: \begin{abstract}
2: Communication overhead is the key challenge for distributed training. Gradient compression is a widely used approach to reduce communication traffic. When combining with parallel communication mechanism method like pipeline, gradient compression technique can greatly alleviate the impact of communication overhead. However, there exists two problems of gradient compression technique to be solved. Firstly, gradient compression brings in extra computation cost, which will delay the next training iteration. Secondly, gradient compression usually leads to the decrease of convergence accuracy. In this paper, we combine parallel mechanism with gradient quantization and delayed full-gradient compensation, and propose a new distributed optimization method named CD-SGD, which can hide the overhead of gradient compression, overlap part of the communication and obtain high convergence accuracy. The local update operation in CD-SGD allows the next iteration to be launched quickly without waiting for the completion of gradient compression and current communication process. Besides, the accuracy loss caused by gradient compression is solved by k-step correction method introduced in CD-SGD. We prove that CD-SGD has convergence guarantee and it achieves at least 
3: \begin{math}
4:  O(\frac {1}{\sqrt{K}}+\frac {1}{K})
5: \end{math} convergence rate. We conduct extensive experiments on MXNet to verify the convergence properties and scaling performance of CD-SGD. Experimental results on a 16-GPU cluster show that convergence accuracy of CD-SGD is close to or even slightly better than that of S-SGD, and its end-to-end time is 30$\%$ less than 2 bit gradient compression under 56Gbps bandwidth environment.
6: \end{abstract}
7: