abstract:bb19cb1794303eee.tex

1: \begin{abstract}

2: 	It is not clear yet why \Adam-alike  adaptive gradient algorithms suffer from worse generalization performance than  \Sgd~despite their faster training speed.   This work aims to provide understandings on this generalization gap by  analyzing 	their local convergence behaviors. Specifically,   we  observe the heavy tails of gradient noise in these algorithms. This motivates us to  analyze these algorithms through their  \levyp-driven stochastic differential equations (SDEs)  because of the similar convergence behaviors of an algorithm and its SDE. Then we  establish the escaping time of these SDEs  from a local basin. The result shows that (1) the escaping time of both \Sgd~and \Adam~depends on the  Radon measure of the basin positively and the heaviness of gradient noise negatively;  (2) for the same basin, \Sgd~enjoys  smaller escaping time than \Adam, mainly because  (a) the geometry adaptation in \Adam~via adaptively scaling each gradient coordinate well diminishes the   anisotropic structure in gradient noise  and results in larger Radon measure of a basin; (b)  the  exponential gradient average in \Adam~smooths its gradient  and  leads to  lighter gradient noise tails than \Sgd.  So \Sgd~is more locally unstable than \Adam~at sharp minima  defined as the minima whose local basins have small Radon measure, and  can better escape from them to flatter ones with larger Radon measure.   As  flat minima here which often refer to the minima  at  flat or asymmetric basins/valleys  often generalize better  than sharp ones~\cite{keskar2016large,he2019asymmetric}, our result  explains the better generalization performance of \Sgd~over \Adam.    Finally,  experimental results confirm our heavy-tailed gradient noise assumption and theoretical affirmation.

3: \end{abstract}

4: