16b85243f8cdba44.tex
1: \begin{abstract}
2: We introduce two complementary techniques for efficient adaptive optimization
3: that reduce memory requirements while accelerating training of
4: large-scale neural networks. The first technique, \emph{Subset-Norm adaptive
5: step size}, generalizes AdaGrad-Norm and AdaGrad(-Coordinate) by reducing
6: the second moment term's memory footprint from $O(d)$ to $O(\sqrt{d})$
7: through step-size sharing, where $d$ is the model size. For non-convex
8: smooth objectives under coordinate-wise sub-gaussian gradient noise, we prove
9: a noise-adapted high-probability convergence guarantee showing improved
10: dimensional dependence over existing methods. Our second technique,
11: \emph{Subspace Momentum}, reduces the momentum state's memory footprint by
12: operating in a low-dimensional subspace while applying standard SGD in the
13: orthogonal complement. We establish high-probability convergence rates under
14: similar relaxed assumptions. Empirical evaluation on LLaMA models from 60M
15: to 1B parameters demonstrates the effectiveness of our methods, where combining subset-norm with subspace-momentum achieves Adam's validation perplexity in approximately half the training
16: tokens (6.8B vs 13.1B) while using only 20\% of the Adam's optimizer-states memory footprint and
17:  requiring minimal additional hyperparameter tuning.
18: \end{abstract}
19: