ca375e3e63ad2502.tex
1: \begin{abstract}
2: %-------------------------------------------------------------------------------
3:   Stochastic gradient descent (SGD) is an inherently sequential
4:   training algorithm--computing the gradient at batch $i$ depends on
5:   the model parameters learned from batch $i-1$.  Prior approaches
6:   that break this dependence do not honor them (e.g., sum the
7:   gradients for each batch, which is not what sequential SGD would do)
8:   and thus potentially suffer from poor convergence. This paper
9:   introduces a novel method to {\emph combine} gradients called
10:   \adasum~(for adaptive sum) that converges faster than prior work.
11:   \adasum is easy to implement, almost as efficient as simply summing
12:   gradients, and is integrated into the open-source toolkit Horovod.
13: 
14:   This paper first provides a formal justification for \adasum and
15:   then empirically demonstrates \adasum is more accurate than prior
16:   gradient accumulation methods.  It then introduces a series of
17:   case-studies to show \adasum works with multiple frameworks,
18:   (\tf and \torch), scales multiple optimizers (Momentum-SGD,
19:   Adam, and LAMB) to larger batch-sizes while still giving good
20:   downstream accuracy. Finally, it proves that \adasum{} converges.
21: 
22:   To summarize, \adasum scales Momentum-SGD on the MLPerf Resnet50
23:   benchmark to 64K examples before communication (no MLPerf entry
24:   converged with more than 16K), the Adam optimizer to 64K examples
25:   before communication on BERT-LARGE (prior work showed Adam stopped
26:   scaling at 16K), and the LAMB optimizer to 128K before communication
27:   on BERT-LARGE (prior work used 64K), all while maintaining
28:   downstream accuracy metrics.  Finally, if a user does not need to
29:   scale, we show LAMB with \adasum on BERT-LARGE converges in 30\%
30:   fewer steps than the baseline.
31: \end{abstract}
32: