1: \begin{abstract}
2: We introduce \OURS, a second order stochastic optimization algorithm which
3: dynamically incorporates the curvature of the loss function via \textsc{ADA}ptive estimates of the \textsc{Hessian}.
4: Second order algorithms are among the most powerful optimization algorithms
5: with superior convergence properties as compared to first order methods such as \sgd and \adam.
6: The main disadvantage of traditional second order methods is their heavier per-iteration computation
7: and poor accuracy as compared to first order methods.
8: To address these, we incorporate several novel approaches in \OURS, including:
9: (i) a fast Hutchinson based method to approximate the curvature matrix with low computational overhead;
10: (ii) a root-mean-square exponential moving average to smooth out variations of the Hessian diagonal across different iterations; and
11: (iii) a block diagonal averaging to reduce the variance of Hessian diagonal elements.
12: We show that \OURS achieves new state-of-the-art results by a large margin as compared
13: to other adaptive optimization methods, including variants of \adam.
14: In particular, we perform extensive tests on CV, NLP, and recommendation system tasks and find that \OURS:
15: (i) achieves 1.80\%/1.45\% higher accuracy on ResNets20/32 on Cifar10, and 5.55\% higher accuracy on ImageNet as compared to \adam;
16: (ii) outperforms \adamw for transformers by 0.13/0.33 BLEU score on IWSLT14/WMT14 and 2.7/1.0 PPL on PTB/Wikitext-103;
17: (iii) outperforms \adamw for SqueezeBert by 0.41 points on GLUE;
18: and
19: (iv) achieves 0.032\% better score than \adagrad for DLRM on the Criteo Ad Kaggle dataset.
20: Importantly, we show that the cost per iteration of \OURS is comparable to first-order
21: methods, and that it exhibits robustness towards its hyperparameters.
22: The code for \OURS is open-sourced and publicly-available~\cite{adahessian}.
23: \end{abstract}