1: \begin{abstract}
2: When using large-batch training to speed up stochastic gradient descent, learning rates must adapt to new batch sizes in order to maximize speed-ups and preserve model quality. Re-tuning learning rates is resource intensive, while fixed scaling rules often degrade model quality. We propose \fullalgname{}, an algorithm that reliably adapts learning rates to large-batch training. By continually adapting to the gradient's variance, \algname{} automatically achieves speed-ups for a wide range of batch sizes. We formally describe this quality with \algname{}'s convergence bound, which maintains final objective values, even as batch sizes grow large and the number of iterations decreases. In empirical comparisons, \algname{} trains well beyond the batch size limits of popular ``linear learning rate scaling'' rules. This includes large-batch training with no model degradation for machine translation, image classification, object detection, and speech recognition tasks. \algname{}'s qualitative behavior is similar to that of ``warm-up'' heuristics, but unlike warm-up, this behavior emerges naturally from a principled mechanism. The algorithm introduces negligible computational overhead and no new hyperparameters, making \algname{} an attractive choice for large-scale training in practice.
3: \end{abstract}
4: