1: \begin{abstract}
2: Communication bottleneck has been identified as a significant issue in
3: distributed optimization of large-scale learning models. Recently,
4: several approaches to mitigate this problem have been proposed,
5: including different forms of gradient compression or computing
6: local models and mixing them iteratively. In this paper, we propose
7: \emph{Qsparse-local-SGD} algorithm, which combines aggressive
8: sparsification with quantization and local computation along with
9: error compensation, by keeping track of the difference between the
10: true and compressed gradients. We propose both synchronous and
11: asynchronous implementations of \emph{Qsparse-local-SGD}. We analyze
12: convergence for \emph{Qsparse-local-SGD} in the \emph{distributed} setting for
13: smooth non-convex and convex objective functions. We demonstrate that
14: \emph{Qsparse-local-SGD} converges at the same rate as vanilla
15: distributed SGD for many important classes of sparsifiers and
16: quantizers. We use \emph{Qsparse-local-SGD} to train ResNet-50 on ImageNet and
17: show that it results in significant savings over the state-of-the-art, in the number of bits transmitted to reach target accuracy.
18: \end{abstract}
19: