abstract:13e5c1b84de96e87.tex

1: \begin{abstract}

2: The \lion{} optimizer

3: has been a promising competitor with the \adamw{}

4: for training large AI models,

5: with advantages on memory, computation, and sample efficiency.

6: In this paper, we introduce \mavolion{}, an innovative adaptation of \lion{} for distributed training environments.

7: Leveraging the sign operator in \lion{},

8: our \mavolion{}

9: only requires to

10: communicate binary or lower-precision vectors

11: between workers to the center server,

12: significantly reducing the communication cost.

13: Our theoretical analysis confirms \mavolion{}'s convergence properties. Empirical results demonstrate its robustness across a range of tasks, worker counts, and batch sizes, on both vision and language problems. Notably, \mavolion{} attains comparable performance to standard \lion{} or \adamw{} optimizers applied on aggregated gradients, but with significantly reduced communication bandwidth. This feature is particularly advantageous for training large models. In addition, we also demonstrate that \mavolion{} presents a more favorable performance-bandwidth balance compared to existing efficient distributed methods such as deep gradient compression and ternary gradients.

14: \end{abstract}

15: