abstract:3327ce39080c5b6e.tex

1: \begin{abstract}

2: Fine-tuning large language models (LLMs) poses significant memory challenges, as the back-propagation process demands extensive resources, especially with growing model sizes. Recent work, \mezo{}, addresses this issue using a zeroth-order (ZO) optimization method, which reduces memory consumption by matching the usage to the inference phase. However, \mezo{} experiences slow convergence due to varying curvatures across model parameters. To overcome this limitation, we introduce \helen{}, a novel scalable and memory-efficient optimizer that integrates annealed A-GNB gradients with a diagonal Hessian estimation and layer-wise clipping, serving as a second-order pre-conditioner.

3: This combination allows for faster and more stable convergence. Our theoretical analysis demonstrates that \helen{} improves convergence rates, particularly for models with heterogeneous layer dimensions, by reducing the dependency on the total parameter space dimension. Instead, the method scales with the largest layer dimension, making it highly suitable for modern LLM architectures. Experimental results on RoBERTa-large and OPT-1.3B across multiple tasks show that \helen{} achieves up to a $20\times$ speedup compared to \mezo{}, with average accuracy improvements of 1.5\%. Furthermore, \helen{} remains compatible with both full parameter tuning and parameter-efficient fine-tuning (PEFT), outperforming several state-of-the-art optimizers.

4: % todo{ say how they overcome the curverture}

5: The codes will be released after reviewing.

6: \end{abstract}

7: