119dcbfc38817335.tex
1: \begin{abstract}
2: When using large-batch training to speed up stochastic gradient descent, learning rates must adapt to new batch sizes in order to maximize speed-ups and preserve model quality.  Re-tuning learning rates is resource intensive, while fixed scaling rules often degrade model quality.  We propose \fullalgname{}, an algorithm that reliably adapts learning rates to large-batch training.  By continually adapting to the gradient's variance, \algname{} automatically achieves speed-ups for a wide range of batch sizes.  We formally describe this quality with \algname{}'s convergence bound, which maintains final objective values, even as batch sizes grow large and the number of iterations decreases.  In empirical comparisons, \algname{} trains well beyond the batch size limits of popular ``linear learning rate scaling'' rules.  This includes large-batch training with no model degradation for machine translation, image classification, object detection, and speech recognition tasks.  \algname{}'s qualitative behavior is similar to that of ``warm-up'' heuristics, but unlike warm-up, this behavior emerges naturally from a principled mechanism.  The algorithm introduces negligible computational overhead and no new hyperparameters, making \algname{} an attractive choice for large-scale training in practice.
3: \end{abstract}
4: