55ebbddbcc689bca.tex
1: \begin{abstract}
2: We consider non-convex stochastic optimization using first-order algorithms for which the gradient estimates may have heavy tails. We show that a combination of gradient clipping, momentum, and normalized gradient descent yields convergence to critical points in high-probability with best-known rates for smooth losses when the gradients only have bounded $\mathfrak{p}$th moments for some $\mathfrak{p}\in(1,2]$. We then consider the case of second-order smooth losses, which to our knowledge have not been studied in this setting, and again obtain high-probability bounds for any $\mathfrak{p}$. Moreover, our results hold for arbitrary smooth norms, in contrast to the typical SGD analysis which requires a Hilbert space norm. Further, we show that after a suitable ``burn-in'' period, the objective value will \emph{monotonically decrease} whenever the current iterate is not a critical point, which provides intuition behind the popular practice of learning rate ``warm-up'' and also yields a last-iterate guarantee.
3: \end{abstract}
4: