1: \begin{abstract}
2: %-------------------------------------------------------------------------------
3: Stochastic gradient descent (SGD) is an inherently sequential
4: training algorithm--computing the gradient at batch $i$ depends on
5: the model parameters learned from batch $i-1$. Prior approaches
6: that break this dependence do not honor them (e.g., sum the
7: gradients for each batch, which is not what sequential SGD would do)
8: and thus potentially suffer from poor convergence. This paper
9: introduces a novel method to {\emph combine} gradients called
10: \adasum~(for adaptive sum) that converges faster than prior work.
11: \adasum is easy to implement, almost as efficient as simply summing
12: gradients, and is integrated into the open-source toolkit Horovod.
13:
14: This paper first provides a formal justification for \adasum and
15: then empirically demonstrates \adasum is more accurate than prior
16: gradient accumulation methods. It then introduces a series of
17: case-studies to show \adasum works with multiple frameworks,
18: (\tf and \torch), scales multiple optimizers (Momentum-SGD,
19: Adam, and LAMB) to larger batch-sizes while still giving good
20: downstream accuracy. Finally, it proves that \adasum{} converges.
21:
22: To summarize, \adasum scales Momentum-SGD on the MLPerf Resnet50
23: benchmark to 64K examples before communication (no MLPerf entry
24: converged with more than 16K), the Adam optimizer to 64K examples
25: before communication on BERT-LARGE (prior work showed Adam stopped
26: scaling at 16K), and the LAMB optimizer to 128K before communication
27: on BERT-LARGE (prior work used 64K), all while maintaining
28: downstream accuracy metrics. Finally, if a user does not need to
29: scale, we show LAMB with \adasum on BERT-LARGE converges in 30\%
30: fewer steps than the baseline.
31: \end{abstract}
32: