abstract:1669cc0d140a4046.tex

1: \begin{abstract}

2: Error Feedback (\algname{EF}) is a highly popular and immensely effective mechanism for fixing convergence issues which arise in distributed training methods (such as distributed \algname{GD} or \algname{SGD}) when these are enhanced with greedy communication compression techniques such as TopK. While \algname{EF} was proposed almost a decade ago~\citep{Seide2014}, and despite concentrated effort by the community to advance the theoretical understanding of this mechanism, there is still a lot to explore. In this work we study a modern form of error feedback called \algname{EF21}~\citep{EF21} which offers the currently best-known theoretical guarantees, under the weakest assumptions, and also works well in practice. In particular, while the theoretical communication complexity of \algname{EF21} depends on the {\em quadratic mean} of certain smoothness parameters, we improve this dependence to their {\em arithmetic mean}, which is always smaller, and can be substantially smaller, especially in heterogeneous data regimes. We take the reader on a journey of our discovery process. Starting with the idea of applying \algname{EF21} to an equivalent reformulation of the underlying problem which (unfortunately) requires (often impractical) machine {\em cloning}, we continue to the discovery of a new {\em weighted} version of \algname{EF21} which can (fortunately) be executed without any cloning, and finally circle back to an improved {\em analysis} of the original \algname{EF21} method. While this development applies to the simplest form of \algname{EF21}, our approach naturally extends to more elaborate variants involving stochastic gradients and partial participation. Further, our technique improves the best-known theory of \algname{EF21} in the {\em rare features} regime~\citep{EF21-RF}. Finally, we validate our theoretical findings with suitable experiments.

3: \end{abstract}

4: