1: \begin{abstract}
2: Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training.
3: The scheme can reach a linear speedup with respect to the number of workers, but
4: this is rarely seen in practice as the scheme often suffers from large network delays and bandwidth limits.
5: To overcome this communication bottleneck recent works propose to reduce the communication frequency. An algorithm of this type is \emph{local SGD} that runs SGD independently in parallel on different workers and averages the sequences only once in a while.
6: This scheme shows promising results in practice, but eluded thorough theoretical analysis.
7:
8: We prove concise convergence rates for local SGD on convex problems and show that it
9: converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers and mini-batch size.
10: The number of communication rounds can be reduced up to a factor of $T^{1/2}$---where $T$ denotes the number of total steps---compared to mini-batch SGD. This also holds for asynchronous implementations.
11:
12: Local SGD can also be used for large scale training of deep learning models. The results shown here aim serving as a guideline to further explore the theoretical and practical aspects of local SGD in these applications.
13: \end{abstract}
14: