67c33909ae4b0bfb.tex
1: \begin{abstract}
2: Lowering the memory requirement in full-parameter training on large models has become a hot research area.
3: MeZO fine-tunes the large language models (LLMs) by just forward passes in a zeroth-order SGD optimizer (ZO-SGD),
4: demonstrating excellent performance with the same GPU memory usage as inference.
5: However,
6: the simulated perturbation stochastic approximation for gradient estimate in MeZO leads to severe oscillations and incurs a substantial time overhead.
7: Moreover,
8: without momentum regularization,
9: MeZO shows severe over-fitting problems.
10: Lastly,
11: the perturbation-irrelevant momentum on ZO-SGD does not improve the convergence rate.
12: This study proposes ZO-AdaMU to resolve the above problems by adapting the simulated perturbation with momentum in its stochastic approximation.
13: Unlike existing adaptive momentum methods,
14: we relocate momentum on simulated perturbation in stochastic gradient approximation.
15: Our convergence analysis and experiments prove this is a better way to improve convergence stability and rate in ZO-SGD.
16: Extensive experiments demonstrate that ZO-AdaMU yields better generalization for LLMs fine-tuning across various NLP tasks than MeZO and its momentum variants.
17: \end{abstract}
18: