abstract:94558d4ef644bfa3.tex

1: \begin{abstract}

2: As deep learning models exponentially increase in size, optimizers such as Adam encounter significant memory consumption challenges due to the storage of first and second moment data.

3: Current memory-efficient methods like Adafactor and CAME often compromise accuracy with their matrix factorization techniques.

4: Addressing this, we introduce Adapprox, a novel approach that employs randomized low-rank matrix approximation for a more effective and accurate approximation of Adam's second moment.

5: Adapprox features an adaptive rank selection mechanism, finely balancing accuracy and memory efficiency, and includes an optional cosine similarity guidance strategy to enhance stability and expedite convergence.

6: % Our empirical evaluations, focusing on training GPT-2 models and subsequent downstream tasks, show that Adapprox not only achieves memory savings of 33.8\% to 49.9\% over Adam while retaining the first moment (and up to 99.9\% when omitted) but also enhances convergence speed and overall performance on pretraining and downstrem tasks compared to counterparts.

7: % In GPT-2 training and subsequent downstream tasks, Adapprox achieves 34.5\%-49.9\% memory savings for the 117M model and 33.8\%-49.9\% for the 345M model with the first moment enabled compared to AdamW (without it, savings rise to 84.5\%-99.9\% and 83.8\%-99.9\%), respectively, also enhancing convergence speed and task performance compared to counterparts.

8: In GPT-2 training and downstream tasks, Adapprox surpasses AdamW by achieving 34.5\% to 49.9\% and 33.8\% to 49.9\% memory savings for the 117M and 345M models, respectively, with the first moment enabled, and further increases these savings without the first moment. Besides, it enhances convergence speed and improves downstream task performance relative to its counterparts.

9: \end{abstract}

10: