abstract:abc2986f32fc3d10.tex

1: \begin{abstract}%

2: Stochastic gradient descent (SGD) is a prevalent optimization technique for large-scale distributed machine learning.

3: While SGD computation can be efficiently divided between multiple machines, communication typically becomes a bottleneck in the distributed setting.

4: Gradient compression methods can be used to alleviate this problem, and a recent line of work shows that SGD augmented with gradient compression converges to an $\eps$-first-order stationary point. %\todo{add citations}

5: In this paper we extend these results to convergence to an $\eps$-\textit{second}-order stationary point ($\eps$-SOSP), which is to the best of our knowledge the first result of this type.

6: In addition, we show that, when the stochastic gradient is not Lipschitz, compressed SGD with \randomk compressor converges to an $\eps$-SOSP with the same number of iterations as uncompressed SGD~\citep{jin2019nonconvex} (JACM), while improving the total communication by a factor of $\tilde \Theta(\sqrt{d} \eps^{-\nicefrac 34})$, where $d$ is the dimension of the optimization problem.

7: We present additional results for the cases when the compressor is arbitrary and when the stochastic gradient is Lipschitz.

8: % Furthermore, under a standard assumption that the stochastic gradient is Lipschitz, the total communication decreases by $\tilde \Theta(\eps^{-\nicefrac 34})$.

9: \end{abstract}

10: