5bfffe4d95dd0011.tex
1: \begin{abstract}
2: %Advances in Deep Neural Networks prompt %(propel) 
3: %the use of large models with more parameters, which in turn presents 
4: Large neural network models present a hefty communication challenge to distributed 
5: Stochastic Gradient Descent (SGD), with a communication complexity 
6: of $\Or(n)$ per worker for a model of $n$ parameters. % trained by $n$ 
7: %workers. 
8: Many sparsification and quantization techniques have been proposed
9: %to cut down the communication traffic in terms of volume and/or frequency.
10: to compress the gradients, 
11: %reducing the complexity down to $\Or(n/k)$, where k may reach 1,000 or 
12: %10,000. 
13: some reducing the communication complexity to $\Or(k)$, where $k \ll 
14: n$. 
15: In this paper, we introduce a strategy called two-level gradient averaging (A2SGD) 
16: to consolidate all gradients down to merely two local averages per worker before
17: the computation of two global averages for an updated model. 
18: A2SGD also retains local errors to maintain the variance for fast convergence.
19: Our theoretical analysis shows that A2SGD converges similarly like the default distributed SGD algorithm. 
20: %Our theoretical analysis shows that A2SGD achieves a convergence rate faster than Top-K.
21: Our evaluation validates the theoretical conclusion and 
22: demonstrates that A2SGD significantly reduces the communication traffic
23: per worker, and improves the overall training time of LSTM-PTB by $3.2\times$ and 
24: $23.2\times$, respectively, compared to Top-K and QSGD. 
25: %Our analysis also show the A2SGD achieves the lowest computation complexity asymptopically, 
26: %compared to these sparsification and quantization techniques.
27: To the best of our knowledge, A2SGD is the first to achieve $\Or(1)$
28: communication complexity per worker for distributed SGD. 
29:    
30: \end{abstract}
31: