1: \begin{abstract}
2: Distributed parallel stochastic gradient descent algorithms are workhorses for large scale machine learning tasks.
3: Among them, local stochastic gradient descent (Local SGD) has attracted significant attention due to its low communication complexity.
4: Previous studies prove that the communication complexity of Local SGD with a fixed or an adaptive communication period is in the order of $O (N^{\frac{3}{2}} T^{\frac{1}{2}})$ and $O (N^{\frac{3}{4}} T^{\frac{3}{4}})$ when the data distributions on clients are identical (IID) or otherwise (Non-IID), where $N$ is the number of clients and $T$ is the number of iterations.
5: In this paper, to accelerate the convergence by reducing the communication complexity,
6: we propose \textit{ST}agewise \textit{L}ocal \textit{SGD} (STL-SGD), which increases the communication period gradually along with decreasing learning rate.
7: We prove that STL-SGD can keep the same convergence rate and linear speedup as mini-batch SGD.
8: In addition, as the benefit of increasing the communication period, when the objective is strongly convex or satisfies the Polyak-\L ojasiewicz condition, the communication complexity of STL-SGD is $O (N \log{T})$ and $O (N^{\frac{1}{2}} T^{\frac{1}{2}})$ for the IID case and the Non-IID case respectively, achieving significant improvements over Local SGD.
9: Experiments on both convex and non-convex problems demonstrate the superior performance of STL-SGD.
10: \end{abstract}
11: