18371e859d3d4f31.tex
1: \begin{abstract}
2:     Distributed parallel stochastic gradient descent algorithms are workhorses for large scale machine learning tasks. 
3:     Among them, local stochastic gradient descent (Local SGD) has attracted significant attention due to its low communication complexity. 
4:     Previous studies prove that the communication complexity of Local SGD with a fixed or an adaptive communication period  is in the order of $O (N^{\frac{3}{2}} T^{\frac{1}{2}})$ and $O (N^{\frac{3}{4}} T^{\frac{3}{4}})$ when the data distributions on clients are identical (IID) or otherwise (Non-IID), where $N$ is the number of clients and $T$ is the number of iterations.
5:     In this paper, to accelerate the convergence by reducing the communication complexity,
6:     we propose \textit{ST}agewise \textit{L}ocal \textit{SGD} (STL-SGD), which increases the communication period gradually along with decreasing learning rate.
7:     We prove that STL-SGD can keep the same convergence rate and linear speedup as mini-batch SGD.
8:     In addition, as the benefit of increasing the communication period, when the objective is strongly convex or satisfies the Polyak-\L ojasiewicz condition, the communication complexity of STL-SGD is $O (N \log{T})$ and $O (N^{\frac{1}{2}} T^{\frac{1}{2}})$ for the IID case and the Non-IID case respectively, achieving significant improvements over Local SGD. 
9:     Experiments on both convex and non-convex problems demonstrate the superior performance of STL-SGD.
10:   \end{abstract}
11: