6fc1522dcd705493.tex
1: \begin{abstract}
2: Training large machine learning models requires a distributed computing approach, with communication of the model updates  being the bottleneck. For this reason, several methods based on the compression (e.g., sparsification and/or quantization) of updates were recently proposed, including {\tt QSGD} \cite{alistarh2017qsgd}, {\tt TernGrad} \cite{wen2017terngrad}, {\tt SignSGD} \cite{pmlr-v80-bernstein18a}, and {\tt DQGD} \cite{khirirat2018distributed}. However, none of these methods are able to learn the gradients, which renders them incapable of converging to the true optimum in the batch mode,  incompatible with  non-smooth regularizers, and slows down their convergence. In this work we propose a new distributed learning method---{\tt DIANA}---which resolves these issues via compression of {\em gradient differences}. We perform a theoretical analysis in the strongly convex and nonconvex settings and show that our rates are superior to existing rates. Our analysis of block-quantization and differences between $\ell_2$ and $\ell_{\infty}$ quantization closes the gaps in theory and practice. Finally, by applying our analysis technique to {\tt TernGrad}, we establish the first convergence rate for this method.
3: 
4:  
5: \end{abstract}
6: