abstract:f6e88f5ae4b1cb03.tex

1: \begin{abstract}

2: In deep learning, different kinds of deep networks typically need different optimizers, which have to be chosen after multiple trials, making the training process inefficient.

3: To relieve this issue and consistently improve the model training speed across deep networks, we propose the ADAptive Nesterov momentum algorithm, Adan for short.

4: Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method, which avoids the extra overhead of  computing gradient at the extrapolation point.

5: Then Adan adopts NME to estimate the gradient's first- and second-order moments in adaptive gradient algorithms for convergence acceleration.

6: Besides, we prove that Adan finds an $\epsilon$-approximate first-order stationary point within $\order{\epsilon^{-3.5}}$ stochastic gradient complexity on the non-convex stochastic problems (\eg~deep learning problems), matching the best-known lower bound.

7: Extensive experimental results show that Adan consistently surpasses the corresponding SoTA optimizers on  vision, language, and RL tasks and sets new SoTAs for many popular networks and frameworks, \eg~ResNet,  ConvNext, ViT, Swin, MAE, DETR, GPT-2, Transformer-XL, and BERT.

8: More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers  to achieve higher or comparable performance on ViT, GPT-2, MAE, \etc, and also shows great tolerance to a large range of minibatch size, \eg~from 1k to 32k.

9: %We hope Adan can contribute to developing deep learning by reducing training costs and relieving the engineering burden of trying different optimizers on various architectures.

10: Code is released at \url{https://github.com/sail-sg/Adan}, and has been used in multiple popular deep learning frameworks or projects.

11: \blfootnote{$^{*}$Equal contribution. Xingyu did this work during an internship at Sea AI Lab.}

12: \end{abstract}

13: