1: \begin{abstract}
2: Training large deep neural networks on massive datasets is computationally very challenging. One promising approach to tackle this issue is through the use of \emph{large batch} parallel stochastic optimization. However, our understanding of this approach in the context of deep learning is still very limited. Furthermore, the current approaches in this direction are heavily hand-tuned. To this end, we first study a general adaptation strategy to accelerate training of deep neural networks using large mini-batches. Using this strategy, we develop a new layer-wise adaptive large batch optimization technique called $\lamb$. We also provide a formal convergence analysis of $\lamb$ as well as the previous published layerwise optimizer $\lars$, showing convergence to a stationary point in general nonconvex settings. Our empirical results demonstrate the superior performance of $\lamb$ for BERT and ResNet-50 training. In particular, for BERT training, our optimizer enables use of very large batches sizes of 32868; thereby, requiring just 8599 iterations to train (as opposed to 1 million iterations in the original paper). By increasing the batch size to the memory limit of a TPUv3 Pod, BERT training time can be reduced from 3 days to 76 minutes (Table \ref{table:results}). Finally, we demonstrate that $\lamb$ outperforms previous large-batch training algorithms for ResNet-50 on ImageNet; obtaining state-of-the-art performance in just a few minutes.
3: %\textcolor{blue}{An implementation of LAMB can be found at https://github.com/xxx.}
4: %The training scripts of BERT and ResNet-50 are available upon request.
5: More results are in the appendix of this paper.
6: \end{abstract}
7: