1: \begin{abstract}
2: With the increase in the number of parameters in large language models, the process of pre-training and fine-tuning increasingly demands larger volumes of GPU memory.
3: A significant portion of this memory is typically consumed by the optimizer state.
4: To overcome this challenge, recent approaches such as low-rank adaptation (LoRA \citep{lora}), low-rank gradient projection (GaLore \citep{zhao2024galore}), and blockwise optimization (BAdam \citep{luo2024badam}) have been proposed.
5: However, in all these algorithms, the \textit{effective rank of the weight updates remains low-rank}, which can lead to a substantial loss of information from the gradient.
6: This loss can be critically important, especially during the pre-training stage.
7: In this paper, we introduce \ALG\ (\textbf{F}ull-\textbf{R}ank \textbf{U}pdates with \textbf{G}r\textbf{A}dient sp\textbf{L}itting), a new memory-efficient optimization framework.
8: \ALG\ leverages gradient splitting to perform low-dimensional updates using advanced
9: % optimization
10: algorithms (such as Adam), while updates along the remaining directions are executed via state-free methods like SGD or signSGD \citep{signsgd-pmlr-v80-bernstein18a}.
11: Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam.
12: We provide theoretical convergence guarantees for our framework when using SGDM for low-dimensional updates and SGD for state-free updates.
13: Additionally, our method consistently outperforms concurrent approaches across various fixed memory budgets, achieving state-of-the-art results in pre-training and fine-tuning tasks while balancing memory efficiency and performance metrics.
14:
15: \end{abstract}
16: