1f2da685531823da.tex
1: \begin{abstract}
2: Training large deep neural networks on massive datasets is  computationally very challenging. There has been recent surge in interest in using \emph{large batch} stochastic optimization methods to tackle this issue. The most prominent algorithm in this line of research is $\lars$, which by  employing \emph{layerwise adaptive} learning rates trains $\resnet$ on ImageNet in a few minutes. However, $\lars$ performs poorly for attention models like $\bert$, indicating that its performance gains are \emph{not} consistent across tasks. In this paper, we first study a principled layerwise adaptation strategy to accelerate training of deep neural networks using large mini-batches. Using this strategy, we develop a new layerwise adaptive large batch optimization technique called $\lamb$; we then provide convergence analysis of $\lamb$ as well as $\lars$, showing convergence to a stationary point in general nonconvex settings. Our empirical results demonstrate the superior performance of $\lamb$ across various tasks such as $\bert$ and $\resnet$-50 training with very little hyperparameter tuning. In particular, for $\bert$ training, our optimizer enables use of very large batch sizes of 32868 without any degradation of performance.  By increasing the batch size to the memory limit of a TPUv3 Pod, $\bert$ training time can be reduced from 3 days to just 76 minutes (Table \ref{table:results}). The $\lamb$ implementation is available online\footnote{\url{https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.py}}.
3: \end{abstract}
4: