abstract:09280b5390e47e84.tex

1: \begin{abstract}

2: 	% Why we do？

3: 	The state-of-the-art deep learning algorithms rely on distributed training systems to tackle the increasing sizes of models and training data sets.

4:     Minibatch stochastic gradient descent (SGD) algorithm requires workers to halt forward/back propagations, to wait for gradients aggregated from all workers, and to receive weight updates before the next batch of tasks.

5:     This synchronous execution model exposes the overheads of gradient/weight communication among the large number of workers a distributed training system.

6: 	% How we do?

7:      We propose a new SGD algorithm, DaSGD (Local SGD with Delayed Averaging), which parallelizes SGD and forward/back propagations to hide $100\%$ of the communication overhead.

8: 	By adjusting the gradient update scheme, this algorithm uses hardware resources more efficiently and reduces the reliance on the low-latency and high-throughput inter-connects.

9: 	The theoretical analysis and the experimental results show its convergence rate $ O (1 / \sqrt {K} )$, the same as SGD.

10: 	The performance evaluation demonstrates it enables a linear performance scale-up with the cluster size.

11:

12: \end{abstract}

13: