abstract:5e2f7d4748fc930d.tex

1: \begin{abstract}

2: Training large neural networks typically requires sharing gradients between accelerators through specialized high-speed interconnects.

3: Drawing from the signal processing principles of frequency decomposition and energy compaction, we demonstrate that synchronizing full optimizer states and model parameters during training is unnecessary.

4: By decoupling momentum updates and allowing controlled divergence in optimizer states across accelerators, we achieve improved convergence compared to state-of-the-art optimizers.

5: We introduce {\textbf{De}}coupled {\textbf{Mo}}mentum (DeMo), a fused optimizer and data parallel algorithm that reduces inter-accelerator communication requirements by several orders of magnitude.

6: This enables training of large neural networks even with limited network bandwidth and heterogeneous hardware.

7: Our method is topology-agnostic and architecture-independent and supports scalable clock-synchronous distributed training with negligible compute and memory overhead.

8: Empirical results show that models trained with DeMo match or exceed the performance of equivalent models trained with AdamW, while eliminating the need for high-speed interconnects when pre-training large scale foundation models. An open source reference PyTorch implementation is published on GitHub at \url{https://github.com/bloc97/DeMo}.

9: \end{abstract}

10: