09280b5390e47e84.tex
1: \begin{abstract}
2: 	% Why we do?
3: 	The state-of-the-art deep learning algorithms rely on distributed training systems to tackle the increasing sizes of models and training data sets.
4:     Minibatch stochastic gradient descent (SGD) algorithm requires workers to halt forward/back propagations, to wait for gradients aggregated from all workers, and to receive weight updates before the next batch of tasks.
5:     This synchronous execution model exposes the overheads of gradient/weight communication among the large number of workers a distributed training system.
6: 	% How we do?
7:      We propose a new SGD algorithm, DaSGD (Local SGD with Delayed Averaging), which parallelizes SGD and forward/back propagations to hide $100\%$ of the communication overhead.
8: 	By adjusting the gradient update scheme, this algorithm uses hardware resources more efficiently and reduces the reliance on the low-latency and high-throughput inter-connects. 
9: 	The theoretical analysis and the experimental results show its convergence rate $ O (1 / \sqrt {K} )$, the same as SGD. 
10: 	The performance evaluation demonstrates it enables a linear performance scale-up with the cluster size.
11: 
12: \end{abstract}
13: