abstract:43ec279cef637599.tex

1: \begin{abstract}

2: We introduce \OURS, a second order stochastic optimization algorithm  which

3: dynamically incorporates the curvature of the loss function via \textsc{ADA}ptive estimates of the \textsc{Hessian}.

4: Second order algorithms are among the most powerful optimization algorithms

5: with superior convergence properties as compared to first order methods such as \sgd and \adam.

6: The main disadvantage of traditional second order methods is their heavier per-iteration computation

7: and poor accuracy as compared to first order methods.

8: To address these, we incorporate several novel approaches in \OURS, including:

9: (i) a fast Hutchinson based method to approximate the curvature matrix with low computational overhead;

10: (ii) a root-mean-square exponential moving average to smooth out variations of the Hessian diagonal across different iterations; and

11: (iii) a block diagonal averaging to reduce the variance of Hessian diagonal elements.

12: We show that \OURS achieves new state-of-the-art results by a large margin as compared

13: to other adaptive optimization methods, including variants of \adam.

14: In particular, we perform extensive tests on CV, NLP, and recommendation system tasks and find that \OURS:

15: (i) achieves 1.80\%/1.45\% higher accuracy on ResNets20/32 on Cifar10, and 5.55\% higher accuracy on ImageNet as compared to \adam;

16: (ii) outperforms \adamw for transformers by 0.13/0.33 BLEU score on IWSLT14/WMT14 and 2.7/1.0 PPL on PTB/Wikitext-103;

17: (iii) outperforms \adamw for SqueezeBert by 0.41 points on GLUE;

18: and

19: (iv) achieves 0.032\% better score than \adagrad for DLRM on the Criteo Ad Kaggle dataset.

20: Importantly, we show that the cost per iteration of \OURS is comparable to first-order

21: methods, and that it exhibits robustness towards its hyperparameters.

22: The code for \OURS is open-sourced and publicly-available~\cite{adahessian}.

23: \end{abstract}