abstract:33ce3ba8087a0093.tex

1: \begin{abstract}

2:   %% Designing an efficient large-scale distributed learning strategy that has a good convergence behavior and low communication cost is not only theoretically appealing but also practically challenging.

3:   Decentralized Parallel SGD (D-PSGD) and its asynchronous variant Asynchronous Parallel SGD (\adpsgd) is a family of distributed learning algorithms that have been demonstrated to perform well for large-scale deep learning tasks. One drawback of \gdpsgd is that the \sg of the mixing matrix decreases when the number of learners in the system increases, which hampers convergence. In this paper, we investigate techniques to accelerate \gdpsgd based training by improving the \sg while minimizing the communication cost. %% Specifically, we devise a randomized local averaging scheme to improve the spectral gap of the doubly stochastic mixing matrix in D-PSGD, which leads to faster convergence than that of fixed local averaging. We also investigate a delay-by-one scheme which allows large batch sizes for speedup.

4:   We demonstrate the effectiveness of our proposed techniques by running experiments on the 2000-hour Switchboard speech recognition task and the ImageNet computer vision task. %% We show that the investigated strategies can significantly improve the training efficiency in term of convergence and communication cost.

5:   On an IBM P9 supercomputer, our system is able to train an LSTM acoustic model in 2.28 hours with 7.5\% WER on the Hub5-2000 Switchboard (SWB) test set and 13.3\% WER on the CallHome (CH) test set using 64 V100 GPUs and in 1.98 hours with 7.7\% WER on SWB and 13.3\% WER on CH using 128 V100 GPUs, the fastest training time reported to date.

6: \end{abstract}