afe58ab060647d63.tex
1: \begin{abstract} 
2: The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam.
3: Pursuing the theory behind warmup,
4: we identify a problem of the adaptive learning rate -- its variance is problematically large in the early stage, and presume warmup works as a variance reduction technique.
5: We provide both empirical and theoretical evidence to verify our hypothesis.
6: We further propose Rectified Adam (RAdam), a novel variant of Adam, by introducing a term to rectify the variance of the adaptive learning rate.
7: Experimental results on image classification, language modeling, and neural machine translation verify our intuition and demonstrate the efficacy and robustness of RAdam.\footnote{All implementations are available at: \url{https://github.com/LiyuanLucasLiu/RAdam}.}
8: 
9: 
10: % Here, we study its mechanism in details. 
11: % This paper presents the theoretical justification of the heuristic, showing that its effectiveness is due to the fact that it reduces the problematically large variance of the adaptive learning rate in the early stage.
12: 
13: %Further analysis derives a reliable and concise variance approximation, and we propose Rectified Adam,
14: % \XD{What RAdam refers to? rectified Adam?}
15: %a new variant of the Adam algorithm which rectifies the variance of the adaptive learning rate.
16: %We conduct experiments on various datasets and observe that RAdam leads to consistent improvements over the vanilla Adam, which demonstrates that the variance issue generally exists and affects training stability and model performance~\footnote{All implementations are available at: \url{https://github.com/LiyuanLucasLiu/RAdam}}.
17: % \XD{Font of x-y axis is too small and hard to see. Pls make it bigger.}
18: % and fixing such issues leads to better robustness and efficiency. 
19: \end{abstract}
20: