abstract:d0631e1b296922c0.tex

1: \begin{abstract}

2: Modern deep learning models are often trained in parallel over a collection of distributed machines to reduce training time. In such settings, communication of model updates among machines becomes a significant performance bottleneck and various lossy update compression  techniques have been proposed to alleviate this problem. In this work, we introduce a new, simple yet theoretically and practically effective compression technique: {\em natural compression ($\NC$)}. Our technique is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a ``natural'' way by ignoring the mantissa. We show that compared to no compression,  $\NC$ increases the second moment of the compressed vector by not more than the tiny factor  $\nicefrac{9}{8}$, which means that the effect of $\NC$ on the convergence speed of popular training algorithms, such as distributed SGD, is negligible.  However, the communications savings enabled by $\NC$ are substantial,

3: %($3.56 \times$ for {\em binary32} and $5.82\times$ for {\em binary64}),

4: leading to {\em $3$-$4\times$ improvement in overall theoretical running time}. For applications requiring more aggressive compression, we generalize $\NC$ to {\em natural dithering}, which we prove is {\em exponentially better} than the common random dithering technique. Our compression operators can be used on their own or in combination with existing operators for a more aggressive combined effect, and offer new state-of-the-art both in theory and practice.%  Finally, we show that besides generic usage in distributed training, $\NC$ is effective for the in-network aggregation (INA)  on a switch, which can only perform integer computations.

5:

6: \end{abstract}

7: