1: \begin{abstract}
2: % Why we do?
3: The state-of-the-art deep learning algorithms rely on distributed training systems to tackle the increasing sizes of models and training data sets.
4: Minibatch stochastic gradient descent (SGD) algorithm requires workers to halt forward/back propagations, to wait for gradients aggregated from all workers, and to receive weight updates before the next batch of tasks.
5: This synchronous execution model exposes the overheads of gradient/weight communication among the large number of workers a distributed training system.
6: % How we do?
7: We propose a new SGD algorithm, DaSGD (Local SGD with Delayed Averaging), which parallelizes SGD and forward/back propagations to hide $100\%$ of the communication overhead.
8: By adjusting the gradient update scheme, this algorithm uses hardware resources more efficiently and reduces the reliance on the low-latency and high-throughput inter-connects.
9: The theoretical analysis and the experimental results show its convergence rate $ O (1 / \sqrt {K} )$, the same as SGD.
10: The performance evaluation demonstrates it enables a linear performance scale-up with the cluster size.
11:
12: \end{abstract}
13: