abstract:4000b456acc6355d.tex

1: \begin{abstract}

2: Data parallelism has become a dominant method to scale Deep Neural Network (DNN) training across multiple nodes.

3: Since synchronizing a large number of gradients of the local model can be a bottleneck for large-scale distributed training,

4: compressing communication data has gained widespread attention recently.

5: Among several recent proposed compression algorithms,

6: Residual Gradient Compression (RGC) is one of the most successful approaches---it can significantly compress the transmitting message size (0.1\% of the gradient size) of each node and still achieve correct accuracy and the same convergence speed.

7: However, the literature on compressing deep networks focuses almost exclusively on achieving good theoretical compression rate, while the efficiency of RGC in real distributed implementation has been less investigated.

8: In this paper, we develop an RGC-based system that is able to reduce the end-to-end training time on real-world multi-GPU systems.

9: Our proposed design called RedSync,

10: which introduces a set of optimizations to reduce communication bandwidth requirement while introducing limited overhead.

11: We evaluate the performance of RedSync on two different multiple GPU platforms, including 128 GPUs of a supercomputer and an 8-GPU server.

12: Our test cases include image classification tasks on Cifar10 and ImageNet, and language modeling tasks on Penn Treebank and Wiki2 datasets.

13: For DNNs featured with high communication to computation ratio, which have long been considered with poor scalability, RedSync brings significant performance improvements.

14: \end{abstract}

15: