1: \begin{abstract}
2: %Advances in Deep Neural Networks prompt %(propel)
3: %the use of large models with more parameters, which in turn presents
4: Large neural network models present a hefty communication challenge to distributed
5: Stochastic Gradient Descent (SGD), with a communication complexity
6: of $\Or(n)$ per worker for a model of $n$ parameters. % trained by $n$
7: %workers.
8: Many sparsification and quantization techniques have been proposed
9: %to cut down the communication traffic in terms of volume and/or frequency.
10: to compress the gradients,
11: %reducing the complexity down to $\Or(n/k)$, where k may reach 1,000 or
12: %10,000.
13: some reducing the communication complexity to $\Or(k)$, where $k \ll
14: n$.
15: In this paper, we introduce a strategy called two-level gradient averaging (A2SGD)
16: to consolidate all gradients down to merely two local averages per worker before
17: the computation of two global averages for an updated model.
18: A2SGD also retains local errors to maintain the variance for fast convergence.
19: Our theoretical analysis shows that A2SGD converges similarly like the default distributed SGD algorithm.
20: %Our theoretical analysis shows that A2SGD achieves a convergence rate faster than Top-K.
21: Our evaluation validates the theoretical conclusion and
22: demonstrates that A2SGD significantly reduces the communication traffic
23: per worker, and improves the overall training time of LSTM-PTB by $3.2\times$ and
24: $23.2\times$, respectively, compared to Top-K and QSGD.
25: %Our analysis also show the A2SGD achieves the lowest computation complexity asymptopically,
26: %compared to these sparsification and quantization techniques.
27: To the best of our knowledge, A2SGD is the first to achieve $\Or(1)$
28: communication complexity per worker for distributed SGD.
29:
30: \end{abstract}
31: