abstract:16b85243f8cdba44.tex

1: \begin{abstract}

2: We introduce two complementary techniques for efficient adaptive optimization

3: that reduce memory requirements while accelerating training of

4: large-scale neural networks. The first technique, \emph{Subset-Norm adaptive

5: step size}, generalizes AdaGrad-Norm and AdaGrad(-Coordinate) by reducing

6: the second moment term's memory footprint from $O(d)$ to $O(\sqrt{d})$

7: through step-size sharing, where $d$ is the model size. For non-convex

8: smooth objectives under coordinate-wise sub-gaussian gradient noise, we prove

9: a noise-adapted high-probability convergence guarantee showing improved

10: dimensional dependence over existing methods. Our second technique,

11: \emph{Subspace Momentum}, reduces the momentum state's memory footprint by

12: operating in a low-dimensional subspace while applying standard SGD in the

13: orthogonal complement. We establish high-probability convergence rates under

14: similar relaxed assumptions. Empirical evaluation on LLaMA models from 60M

15: to 1B parameters demonstrates the effectiveness of our methods, where combining subset-norm with subspace-momentum achieves Adam's validation perplexity in approximately half the training

16: tokens (6.8B vs 13.1B) while using only 20\% of the Adam's optimizer-states memory footprint and

17:  requiring minimal additional hyperparameter tuning.

18: \end{abstract}

19: