1: \begin{abstract}
2: Distributed training of massive machine learning models, in particular deep neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace.
3: %
4: Several families of communication-reduction methods, such as quantization, large-batch methods, and gradient sparsification, have been proposed.
5: %
6: To date, gradient sparsification methods--where each node sorts gradients by magnitude, and only communicates a subset of the components, accumulating the rest locally--are known to yield some of the largest practical gains.
7: %
8: Such methods can reduce the amount of communication per step by up to \emph{three orders of magnitude}, while preserving model accuracy.
9: %
10: Yet, this family of methods currently has no theoretical justification.
11:
12: This is the question we address in this paper.
13: We prove that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD.
14: The main insight is that sparsification methods implicitly maintain bounds on the maximum impact of stale updates, thanks to selection by magnitude.
15: Our analysis and empirical validation also reveal that these methods do require analytical conditions to converge well, justifying existing heuristics.
16: \end{abstract}
17: