abstract:502112de0ffd8836.tex

1: \begin{abstract}

2:   % The abstract paragraph should be indented \nicefrac{1}{2}~inch (3~picas) on

3:   % both the left- and right-hand margins. Use 10~point type, with a vertical

4:   % spacing (leading) of 11~points.  The word \textbf{Abstract} must be centered,

5:   % bold, and in point size 12. Two line spaces precede the abstract. The abstract

6:   % must be limited to one paragraph.

7: Adaptive gradient-based optimizers, particularly Adam, have left their mark in training large-scale deep learning models. The strength of such optimizers is that they exhibit fast convergence while being more robust to hyperparameter choice. However, they often generalize worse than non-adaptive methods. Recent studies have tied this performance gap to flat minima selection: adaptive methods tend to find solutions in sharper basins of the loss landscape, which in turn hurts generalization. To overcome this issue, we propose a new memory-augmented version of Adam that promotes \emph{exploration} towards flatter minima by using a buffer of critical momentum terms during training. Intuitively, the use of the buffer makes the optimizer overshoot outside the basin of attraction if it is not wide enough. We empirically show that our method improves the performance of several variants of Adam on standard supervised language modelling and image classification tasks.

8:  %language modelling image classification and

9: %benchmarks.

10: %including a sharpness aware Adam + SAM and a memory-augmented variant with critical gradients,

11:

12: %Our approach is complementary to sharpness-aware minimization (SAM); we show that combining the two can further enhance generalization. % further when combined with SAM. % (CIFAR 10, CIFAR 100, Imagenet, Penn TreeBank). %  and show that it finds flatter solutions and improves the performance.  %performance of deep learning models

13: %on several supervised language and image tasks.

14: \end{abstract}

15: