1: \begin{abstract}
2: Communication overhead is a major bottleneck hampering the scalability of
3: distributed machine learning systems.
4: Recently, there has been a surge of interest in
5: using gradient compression
6: to improve the communication efficiency of distributed
7: %synchronous
8: neural network
9: training.
10: Using 1-bit quantization,
11: signSGD with majority vote
12: achieves a 32x reduction on communication cost. However, its convergence is based
13: on unrealistic assumptions and can diverge in practice.
14: In this paper, we propose a general distributed compressed SGD with
15: Nesterov's momentum. We consider two-way
16: compression, which
17: compresses
18: the gradients
19: both
20: to and from workers.
21: %the parameter server aggregates the workers' compressed gradients and transmits a compressed merged result back to the workers.
22: Convergence analysis on nonconvex problems for
23: general gradient compressors
24: is provided.
25: By partitioning
26: the gradient
27: into blocks,
28: a blockwise compressor is introduced such that
29: each gradient block is compressed and transmitted in 1-bit format with a scaling
30: factor, leading to a nearly 32x reduction on communication.
31: Experimental results show that the proposed method converges as fast as
32: full-precision distributed momentum SGD and achieves the same testing accuracy. In
33: particular,
34: on distributed ResNet training with 7 workers
35: on the ImageNet,
36: the proposed algorithm achieves the same testing accuracy as
37: momentum SGD using full-precision gradients, but with $46\%$ less wall clock time.
38: \end{abstract}
39: