5ade6b32bedb1caf.tex
1: \begin{abstract}
2: 	Deep learning networks are typically trained by Stochastic Gradient Descent (SGD) methods that iteratively improve the model parameters by estimating a gradient on a very small fraction of the training data. A major roadblock faced when increasing the batch size to a substantial fraction of the training data for improving training time
3: 	is the persistent degradation in performance (generalization gap).
4: 	To address this issue, recent work propose to add small perturbations to the model parameters when computing the stochastic gradients and report improved generalization performance due to smoothing effects.
5: 	However, this approach is poorly understood; it requires often model-specific noise and fine-tuning.\\
6: 	To alleviate these drawbacks, we propose to use instead computationally efficient extrapolation (extragradient) to stabilize the optimization trajectory
7: 	while still benefiting from smoothing to avoid sharp minima.
8: 	This principled approach is well grounded from an optimization perspective and we show that a host of variations can be covered in a unified framework that we propose.
9: 	We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
10: 	We demonstrate that in a variety of experiments the scheme allows scaling to much larger batch sizes than before whilst reaching or surpassing SOTA accuracy.
11: 
12: \end{abstract}
13: