abstract:5ade6b32bedb1caf.tex

1: \begin{abstract}

2: 	Deep learning networks are typically trained by Stochastic Gradient Descent (SGD) methods that iteratively improve the model parameters by estimating a gradient on a very small fraction of the training data. A major roadblock faced when increasing the batch size to a substantial fraction of the training data for improving training time

3: 	is the persistent degradation in performance (generalization gap).

4: 	To address this issue, recent work propose to add small perturbations to the model parameters when computing the stochastic gradients and report improved generalization performance due to smoothing effects.

5: 	However, this approach is poorly understood; it requires often model-specific noise and fine-tuning.\\

6: 	To alleviate these drawbacks, we propose to use instead computationally efficient extrapolation (extragradient) to stabilize the optimization trajectory

7: 	while still benefiting from smoothing to avoid sharp minima.

8: 	This principled approach is well grounded from an optimization perspective and we show that a host of variations can be covered in a unified framework that we propose.

9: 	We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.

10: 	We demonstrate that in a variety of experiments the scheme allows scaling to much larger batch sizes than before whilst reaching or surpassing SOTA accuracy.

11:

12: \end{abstract}

13: