303d48a9127d8cc6.tex
1: \begin{abstract}
2: Communication overhead is a major bottleneck hampering the scalability of
3: distributed machine learning systems.  
4: Recently, there has been a surge of interest in 
5: using gradient compression
6: to improve the communication efficiency of distributed 
7: %synchronous 
8: neural network
9: training. 
10: Using 1-bit quantization,
11: signSGD with majority vote 
12: achieves a 32x reduction on communication cost. However, its convergence is based
13: on unrealistic assumptions and can diverge in practice. 
14: In this paper, we propose a general distributed compressed SGD with
15: Nesterov's momentum. We consider two-way
16: compression, which 
17: compresses
18: the gradients 
19: both 
20: to and from workers.
21: %the parameter server aggregates the workers' compressed gradients and transmits a compressed merged result back to the workers. 
22: Convergence analysis on nonconvex problems for
23: general gradient compressors
24: is provided. 
25: By partitioning 
26: the gradient 
27: into blocks, 
28: a blockwise compressor is introduced such that 
29: each gradient block is compressed and transmitted in 1-bit format with a scaling
30: factor, leading to a nearly 32x reduction on communication. 
31: Experimental results show that the proposed method converges as fast as
32: full-precision distributed momentum SGD and achieves the same testing accuracy.  In
33: particular,
34: on distributed ResNet training with 7 workers
35: on the ImageNet, 
36: the proposed algorithm achieves the same testing accuracy as
37: momentum SGD using full-precision gradients, but with $46\%$ less wall clock time. 
38: \end{abstract}
39: