2095e1c09e52eaf9.tex
1: \begin{abstract}
2: Training neural networks requires optimizing a loss function that may be highly irregular, and in particular neither convex nor smooth. Popular training algorithms are based on stochastic gradient  descent with momentum (SGDM), for which classical analysis applies only if the loss is either convex  or smooth. We show that a very small modification to SGDM closes this gap: simply scale the update at each time point by an exponentially  distributed random scalar. The resulting algorithm achieves optimal convergence guarantees. Intriguingly, this result is not derived by a specific analysis of SGDM: instead, it falls naturally out of a more general framework for converting online convex optimization algorithms to non-convex optimization algorithms.
3: 
4: 
5: % We introduce $(c,\epsilon)$-stationary point as a new convergence criterion for non-smooth non-convex stochastic optimization, which, with $c=\epsilon/\delta^2$, is a relaxation of the standard $(\delta,\epsilon)$-stationary point criterion. Additionally, we propose a general framework that converts online convex optimization algorithms to non-smooth optimization algorithms. Applied with an unconstrained variant of online gradient descent, this algorithm recovers the update mechanism of stochastic gradient descent with momentum. Applied with online gradient descent with adaptive learning rate, this algorithm further reduces to an adaptive momentum algorithm. Both algorithms find $(c,\epsilon)$-stationary point within $O(c^{1/2}\epsilon^{-7/2})$ iterations. Remarkably, they automatically achieve the optimal rates of $O(\epsilon^{-4})$ for smooth objectives and $O(\epsilon^{-7/2})$ for second-order smooth objectives.
6: \end{abstract}
7: