1: \begin{abstract}
2: Deep learning networks are typically trained by Stochastic Gradient Descent (SGD) methods that iteratively improve the model parameters by estimating a gradient on a very small fraction of the training data. A major roadblock faced when increasing the batch size to a substantial fraction of the training data for improving training time
3: is the persistent degradation in performance (generalization gap).
4: To address this issue, recent work propose to add small perturbations to the model parameters when computing the stochastic gradients and report improved generalization performance due to smoothing effects.
5: However, this approach is poorly understood; it requires often model-specific noise and fine-tuning.\\
6: To alleviate these drawbacks, we propose to use instead computationally efficient extrapolation (extragradient) to stabilize the optimization trajectory
7: while still benefiting from smoothing to avoid sharp minima.
8: This principled approach is well grounded from an optimization perspective and we show that a host of variations can be covered in a unified framework that we propose.
9: We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
10: We demonstrate that in a variety of experiments the scheme allows scaling to much larger batch sizes than before whilst reaching or surpassing SOTA accuracy.
11:
12: \end{abstract}
13: