f6e88f5ae4b1cb03.tex
1: \begin{abstract}
2: In deep learning, different kinds of deep networks typically need different optimizers, which have to be chosen after multiple trials, making the training process inefficient. 
3: To relieve this issue and consistently improve the model training speed across deep networks, we propose the ADAptive Nesterov momentum algorithm, Adan for short.
4: Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method, which avoids the extra overhead of  computing gradient at the extrapolation point. 
5: Then Adan adopts NME to estimate the gradient's first- and second-order moments in adaptive gradient algorithms for convergence acceleration.
6: Besides, we prove that Adan finds an $\epsilon$-approximate first-order stationary point within $\order{\epsilon^{-3.5}}$ stochastic gradient complexity on the non-convex stochastic problems (\eg~deep learning problems), matching the best-known lower bound.
7: Extensive experimental results show that Adan consistently surpasses the corresponding SoTA optimizers on  vision, language, and RL tasks and sets new SoTAs for many popular networks and frameworks, \eg~ResNet,  ConvNext, ViT, Swin, MAE, DETR, GPT-2, Transformer-XL, and BERT. 
8: More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers  to achieve higher or comparable performance on ViT, GPT-2, MAE, \etc, and also shows great tolerance to a large range of minibatch size, \eg~from 1k to 32k. 
9: %We hope Adan can contribute to developing deep learning by reducing training costs and relieving the engineering burden of trying different optimizers on various architectures. 
10: Code is released at \url{https://github.com/sail-sg/Adan}, and has been used in multiple popular deep learning frameworks or projects.
11: \blfootnote{$^{*}$Equal contribution. Xingyu did this work during an internship at Sea AI Lab.}  
12: \end{abstract}
13: