04203f684c701a75.tex
1: \begin{abstract}
2:     We propose a new variant of the Adam optimizer~\citep{kingma2014adam} called \method{} that specifically minimizes memory overheads, while maintaining theoretical convergence guarantees. 
3:     We achieve this by compressing the gradient information before it is fed into the optimizer state, 
4:     thereby reducing its memory footprint significantly. 
5:     We control the resulting compression  error via a novel instance of the classical \emph{error feedback} mechanism from distributed optimization~\citep{2014-seide, 2018-alistarh, 2019-karimireddy} in which \emph{the error correction information is itself compressed} to allow for practical memory gains. 
6:     We prove that the resulting approach maintains theoretical convergence guarantees competitive to those of AMSGrad, while providing good practical performance. 
7:     Specifically, we show that \method{} can be implemented efficiently on GPUs: on both million-scale (BERT) and billion-scale (LLaMA) models, \method{} provides practical convergence competitive to that of the uncompressed Adam baseline, with  lower memory usage and similar running time. Our code is available at \texttt{https://github.com/IST-DASLab/MicroAdam}.
8: \end{abstract}
9: