abstract:67c33909ae4b0bfb.tex

1: \begin{abstract}

2: Lowering the memory requirement in full-parameter training on large models has become a hot research area.

3: MeZO fine-tunes the large language models (LLMs) by just forward passes in a zeroth-order SGD optimizer (ZO-SGD),

4: demonstrating excellent performance with the same GPU memory usage as inference.

5: However,

6: the simulated perturbation stochastic approximation for gradient estimate in MeZO leads to severe oscillations and incurs a substantial time overhead.

7: Moreover,

8: without momentum regularization,

9: MeZO shows severe over-fitting problems.

10: Lastly,

11: the perturbation-irrelevant momentum on ZO-SGD does not improve the convergence rate.

12: This study proposes ZO-AdaMU to resolve the above problems by adapting the simulated perturbation with momentum in its stochastic approximation.

13: Unlike existing adaptive momentum methods,

14: we relocate momentum on simulated perturbation in stochastic gradient approximation.

15: Our convergence analysis and experiments prove this is a better way to improve convergence stability and rate in ZO-SGD.

16: Extensive experiments demonstrate that ZO-AdaMU yields better generalization for LLMs fine-tuning across various NLP tasks than MeZO and its momentum variants.

17: \end{abstract}

18: