1: \begin{abstract}
2: Data parallelism has become a dominant method to scale Deep Neural Network (DNN) training across multiple nodes.
3: Since synchronizing a large number of gradients of the local model can be a bottleneck for large-scale distributed training,
4: compressing communication data has gained widespread attention recently.
5: Among several recent proposed compression algorithms,
6: Residual Gradient Compression (RGC) is one of the most successful approaches---it can significantly compress the transmitting message size (0.1\% of the gradient size) of each node and still achieve correct accuracy and the same convergence speed.
7: However, the literature on compressing deep networks focuses almost exclusively on achieving good theoretical compression rate, while the efficiency of RGC in real distributed implementation has been less investigated.
8: In this paper, we develop an RGC-based system that is able to reduce the end-to-end training time on real-world multi-GPU systems.
9: Our proposed design called RedSync,
10: which introduces a set of optimizations to reduce communication bandwidth requirement while introducing limited overhead.
11: We evaluate the performance of RedSync on two different multiple GPU platforms, including 128 GPUs of a supercomputer and an 8-GPU server.
12: Our test cases include image classification tasks on Cifar10 and ImageNet, and language modeling tasks on Penn Treebank and Wiki2 datasets.
13: For DNNs featured with high communication to computation ratio, which have long been considered with poor scalability, RedSync brings significant performance improvements.
14: \end{abstract}
15: