abstract:ca375e3e63ad2502.tex

1: \begin{abstract}

2: %-------------------------------------------------------------------------------

3:   Stochastic gradient descent (SGD) is an inherently sequential

4:   training algorithm--computing the gradient at batch $i$ depends on

5:   the model parameters learned from batch $i-1$.  Prior approaches

6:   that break this dependence do not honor them (e.g., sum the

7:   gradients for each batch, which is not what sequential SGD would do)

8:   and thus potentially suffer from poor convergence. This paper

9:   introduces a novel method to {\emph combine} gradients called

10:   \adasum~(for adaptive sum) that converges faster than prior work.

11:   \adasum is easy to implement, almost as efficient as simply summing

12:   gradients, and is integrated into the open-source toolkit Horovod.

13:

14:   This paper first provides a formal justification for \adasum and

15:   then empirically demonstrates \adasum is more accurate than prior

16:   gradient accumulation methods.  It then introduces a series of

17:   case-studies to show \adasum works with multiple frameworks,

18:   (\tf and \torch), scales multiple optimizers (Momentum-SGD,

19:   Adam, and LAMB) to larger batch-sizes while still giving good

20:   downstream accuracy. Finally, it proves that \adasum{} converges.

21:

22:   To summarize, \adasum scales Momentum-SGD on the MLPerf Resnet50

23:   benchmark to 64K examples before communication (no MLPerf entry

24:   converged with more than 16K), the Adam optimizer to 64K examples

25:   before communication on BERT-LARGE (prior work showed Adam stopped

26:   scaling at 16K), and the LAMB optimizer to 128K before communication

27:   on BERT-LARGE (prior work used 64K), all while maintaining

28:   downstream accuracy metrics.  Finally, if a user does not need to

29:   scale, we show LAMB with \adasum on BERT-LARGE converges in 30\%

30:   fewer steps than the baseline.

31: \end{abstract}

32: