abstract:5bfffe4d95dd0011.tex

1: \begin{abstract}

2: %Advances in Deep Neural Networks prompt %(propel)

3: %the use of large models with more parameters, which in turn presents

4: Large neural network models present a hefty communication challenge to distributed

5: Stochastic Gradient Descent (SGD), with a communication complexity

6: of $\Or(n)$ per worker for a model of $n$ parameters. % trained by $n$

7: %workers.

8: Many sparsification and quantization techniques have been proposed

9: %to cut down the communication traffic in terms of volume and/or frequency.

10: to compress the gradients,

11: %reducing the complexity down to $\Or(n/k)$, where k may reach 1,000 or

12: %10,000.

13: some reducing the communication complexity to $\Or(k)$, where $k \ll

14: n$.

15: In this paper, we introduce a strategy called two-level gradient averaging (A2SGD)

16: to consolidate all gradients down to merely two local averages per worker before

17: the computation of two global averages for an updated model.

18: A2SGD also retains local errors to maintain the variance for fast convergence.

19: Our theoretical analysis shows that A2SGD converges similarly like the default distributed SGD algorithm.

20: %Our theoretical analysis shows that A2SGD achieves a convergence rate faster than Top-K.

21: Our evaluation validates the theoretical conclusion and

22: demonstrates that A2SGD significantly reduces the communication traffic

23: per worker, and improves the overall training time of LSTM-PTB by $3.2\times$ and

24: $23.2\times$, respectively, compared to Top-K and QSGD.

25: %Our analysis also show the A2SGD achieves the lowest computation complexity asymptopically,

26: %compared to these sparsification and quantization techniques.

27: To the best of our knowledge, A2SGD is the first to achieve $\Or(1)$

28: communication complexity per worker for distributed SGD.

29:

30: \end{abstract}

31: