abstract:04203f684c701a75.tex

1: \begin{abstract}

2:     We propose a new variant of the Adam optimizer~\citep{kingma2014adam} called \method{} that specifically minimizes memory overheads, while maintaining theoretical convergence guarantees.

3:     We achieve this by compressing the gradient information before it is fed into the optimizer state,

4:     thereby reducing its memory footprint significantly.

5:     We control the resulting compression  error via a novel instance of the classical \emph{error feedback} mechanism from distributed optimization~\citep{2014-seide, 2018-alistarh, 2019-karimireddy} in which \emph{the error correction information is itself compressed} to allow for practical memory gains.

6:     We prove that the resulting approach maintains theoretical convergence guarantees competitive to those of AMSGrad, while providing good practical performance.

7:     Specifically, we show that \method{} can be implemented efficiently on GPUs: on both million-scale (BERT) and billion-scale (LLaMA) models, \method{} provides practical convergence competitive to that of the uncompressed Adam baseline, with  lower memory usage and similar running time. Our code is available at \texttt{https://github.com/IST-DASLab/MicroAdam}.

8: \end{abstract}

9: