abstract:eceb1875f846f08b.tex

1: \begin{abstract}

2: Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training.

3: The scheme can reach a linear speedup with respect to the number of workers, but

4: this is rarely seen in practice as the scheme often suffers from large network delays and bandwidth limits.

5: To overcome this communication bottleneck recent works propose to reduce the communication frequency. An algorithm of this type is \emph{local SGD} that runs SGD independently in parallel on different workers and averages the sequences only once in a while.

6: This scheme shows promising results in practice, but eluded thorough theoretical analysis.

7:

8: We prove concise convergence rates for local SGD on convex problems and show that it

9: converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers and mini-batch size.

10: The number of  communication rounds can be reduced up to a factor of $T^{1/2}$---where $T$ denotes the number of total steps---compared to mini-batch SGD. This also holds for asynchronous implementations.

11:

12: Local SGD can also be used for large scale training of deep learning models. The results shown here aim serving as a guideline to further explore the theoretical and practical aspects of local SGD in these applications.

13: \end{abstract}

14: