1: \begin{abstract}
2: It is not clear yet why \Adam-alike adaptive gradient algorithms suffer from worse generalization performance than \Sgd~despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the heavy tails of gradient noise in these algorithms. This motivates us to analyze these algorithms through their \levyp-driven stochastic differential equations (SDEs) because of the similar convergence behaviors of an algorithm and its SDE. Then we establish the escaping time of these SDEs from a local basin. The result shows that (1) the escaping time of both \Sgd~and \Adam~depends on the Radon measure of the basin positively and the heaviness of gradient noise negatively; (2) for the same basin, \Sgd~enjoys smaller escaping time than \Adam, mainly because (a) the geometry adaptation in \Adam~via adaptively scaling each gradient coordinate well diminishes the anisotropic structure in gradient noise and results in larger Radon measure of a basin; (b) the exponential gradient average in \Adam~smooths its gradient and leads to lighter gradient noise tails than \Sgd. So \Sgd~is more locally unstable than \Adam~at sharp minima defined as the minima whose local basins have small Radon measure, and can better escape from them to flatter ones with larger Radon measure. As flat minima here which often refer to the minima at flat or asymmetric basins/valleys often generalize better than sharp ones~\cite{keskar2016large,he2019asymmetric}, our result explains the better generalization performance of \Sgd~over \Adam. Finally, experimental results confirm our heavy-tailed gradient noise assumption and theoretical affirmation.
3: \end{abstract}
4: