39db3eb2d9719400.tex
1: \begin{abstract}
2:   Distributed training of massive machine learning models, in particular deep neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace. 
3:   %
4:   Several families of communication-reduction methods, such as quantization, large-batch methods, and gradient sparsification, have been proposed. 
5:   %
6:   To date, gradient sparsification methods--where each node sorts gradients by magnitude, and only communicates a subset of the components, accumulating the rest locally--are known to yield some of the largest practical gains. 
7:   %
8:   Such methods can reduce the amount of communication per step by up to \emph{three orders of magnitude}, while preserving model accuracy.
9:   %
10:   Yet, this family of methods currently has no theoretical justification. 
11: 
12: This is the question we address in this paper. 
13: We prove that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD. 
14: The main insight is that sparsification methods implicitly maintain bounds on the maximum impact of stale updates, thanks to selection by magnitude. 
15: Our analysis and empirical validation also reveal that these methods do require analytical conditions to converge well, justifying existing heuristics. 
16: \end{abstract}
17: