abstract:39db3eb2d9719400.tex

1: \begin{abstract}

2:   Distributed training of massive machine learning models, in particular deep neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace.

3:   %

4:   Several families of communication-reduction methods, such as quantization, large-batch methods, and gradient sparsification, have been proposed.

5:   %

6:   To date, gradient sparsification methods--where each node sorts gradients by magnitude, and only communicates a subset of the components, accumulating the rest locally--are known to yield some of the largest practical gains.

7:   %

8:   Such methods can reduce the amount of communication per step by up to \emph{three orders of magnitude}, while preserving model accuracy.

9:   %

10:   Yet, this family of methods currently has no theoretical justification.

11:

12: This is the question we address in this paper.

13: We prove that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD.

14: The main insight is that sparsification methods implicitly maintain bounds on the maximum impact of stale updates, thanks to selection by magnitude.

15: Our analysis and empirical validation also reveal that these methods do require analytical conditions to converge well, justifying existing heuristics.

16: \end{abstract}

17: