abstract:06438ec8d2eaa435.tex

1: \begin{abstract}

2: Communication bottleneck has been identified as a significant issue in

3: distributed optimization of large-scale learning models. Recently,

4: several approaches to mitigate this problem have been proposed,

5: including different forms of gradient compression or computing

6: local models and mixing them iteratively. In this paper, we propose

7: \emph{Qsparse-local-SGD} algorithm, which combines aggressive

8: sparsification with quantization and local computation along with

9: error compensation, by keeping track of the difference between the

10: true and compressed gradients. We propose both synchronous and

11: asynchronous implementations of \emph{Qsparse-local-SGD}. We analyze

12: convergence for \emph{Qsparse-local-SGD} in the \emph{distributed} setting for

13: smooth non-convex and convex objective functions. We demonstrate that

14: \emph{Qsparse-local-SGD} converges at the same rate as vanilla

15: distributed SGD for many important classes of sparsifiers and

16: quantizers. We use \emph{Qsparse-local-SGD} to train ResNet-50 on ImageNet and

17: show that it results in significant savings over the state-of-the-art, in the number of bits transmitted to reach target accuracy.

18: \end{abstract}

19: