abstract:303d48a9127d8cc6.tex

1: \begin{abstract}

2: Communication overhead is a major bottleneck hampering the scalability of

3: distributed machine learning systems.

4: Recently, there has been a surge of interest in

5: using gradient compression

6: to improve the communication efficiency of distributed

7: %synchronous

8: neural network

9: training.

10: Using 1-bit quantization,

11: signSGD with majority vote

12: achieves a 32x reduction on communication cost. However, its convergence is based

13: on unrealistic assumptions and can diverge in practice.

14: In this paper, we propose a general distributed compressed SGD with

15: Nesterov's momentum. We consider two-way

16: compression, which

17: compresses

18: the gradients

19: both

20: to and from workers.

21: %the parameter server aggregates the workers' compressed gradients and transmits a compressed merged result back to the workers.

22: Convergence analysis on nonconvex problems for

23: general gradient compressors

24: is provided.

25: By partitioning

26: the gradient

27: into blocks,

28: a blockwise compressor is introduced such that

29: each gradient block is compressed and transmitted in 1-bit format with a scaling

30: factor, leading to a nearly 32x reduction on communication.

31: Experimental results show that the proposed method converges as fast as

32: full-precision distributed momentum SGD and achieves the same testing accuracy.  In

33: particular,

34: on distributed ResNet training with 7 workers

35: on the ImageNet,

36: the proposed algorithm achieves the same testing accuracy as

37: momentum SGD using full-precision gradients, but with $46\%$ less wall clock time.

38: \end{abstract}

39: