abstract:45d49f42cdfbfd73.tex

1: \begin{abstract}

2: %\vspace*{-0.15in}

3: Distributed Deep Learning (DDL) is essential for large-scale Deep Learning (DL) training. \weiz{Synchronous Stochastic Gradient Descent (SSGD)~\footnote{In the literature, SSGD is also called "Centralized Synchronized Stochastic Gradient Descent". In this paper, we use these two terms interchangeably.} is the de facto DDL optimization method. Using a sufficiently large batch size is critical to achieving DDL runtime speedup. In a large batch setting, the learning rate must be increased to compensate for the reduced number of parameter updates. However, a large learning rate may harm convergence in SSGD and training could easily diverge.}  \weiz{Recently, Decentralized Parallel SGD (DPSGD) has been proposed to improve distributed training speed.

4: % While there is anecdotal evidence that DPSGD outperforms SSGD in the large-batch setting, no systematic study has been conducted to explain why this is the case.

5: In this paper, we find that DPSGD not only has a system-wise runtime benefit but also a significant convergence benefit over SSGD in the large batch setting.}

6: Based on a detailed analysis of the DPSGD learning dynamics, we find that DPSGD introduces additional landscape-dependent noise that automatically adjusts the effective learning rate to improve convergence. In addition, we theoretically show that this noise smoothes the loss landscape, hence allowing a larger learning rate. We conduct extensive studies over 18 state-of-the-art DL models/tasks and demonstrate that DPSGD often converges in cases where SSGD diverges for large learning rates in the large batch setting. Our findings are consistent across two different application domains: Computer Vision (CIFAR10 and ImageNet-1K) and Automatic Speech Recognition (SWB300 and SWB2000), and two different types of neural network models: Convolutional Neural Networks and Long Short-Term Memory Recurrent Neural Networks.

7: \end{abstract}

8: