abstract:e339ac0459afc48e.tex

1: \begin{abstract}

2: With the increase in the number of parameters in large language models, the process of pre-training and fine-tuning increasingly demands larger volumes of GPU memory.

3: A significant portion of this memory is typically consumed by the optimizer state.

4: To overcome this challenge, recent approaches such as low-rank adaptation (LoRA \citep{lora}), low-rank gradient projection (GaLore \citep{zhao2024galore}), and blockwise optimization (BAdam \citep{luo2024badam}) have been proposed.

5: However, in all these algorithms, the \textit{effective rank of the weight updates remains low-rank}, which can lead to a substantial loss of information from the gradient.

6: This loss can be critically important, especially during the pre-training stage.

7: In this paper, we introduce \ALG\ (\textbf{F}ull-\textbf{R}ank \textbf{U}pdates with \textbf{G}r\textbf{A}dient sp\textbf{L}itting), a new memory-efficient optimization framework.

8: \ALG\ leverages gradient splitting to perform low-dimensional updates using advanced

9: % optimization

10: algorithms (such as Adam), while updates along the remaining directions are executed via state-free methods like SGD or signSGD \citep{signsgd-pmlr-v80-bernstein18a}.

11: Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam.

12: We provide theoretical convergence guarantees for our framework when using SGDM for low-dimensional updates and SGD for state-free updates.

13: Additionally, our method consistently outperforms concurrent approaches across various fixed memory budgets, achieving state-of-the-art results in pre-training and fine-tuning tasks while balancing memory efficiency and performance metrics.

14:

15: \end{abstract}

16: