1: \begin{abstract}% <- trailing '%' for backward compatibility of .sty file
2:
3: Many popular learning-rate schedules for deep neural networks combine a decaying trend with local perturbations that attempt to escape saddle points and bad local minima. We derive convergence guarantees for bandwidth-based step-sizes, a general class of learning-rates that are allowed to vary in a banded region. This framework includes many popular cyclic and non-monotonic step-sizes for which no theoretical guarantees were previously known. We provide worst-case guarantees for SGD on smooth non-convex problems under several bandwidth-based step sizes, including stagewise $1/\sqrt{t}$ and the popular \emph{step-decay} (``constant and then drop by a constant’’), which is also shown to be optimal. Moreover, we show that its momentum variant converges as fast as SGD with the bandwidth-based step-decay step-size. Finally, we propose novel step-size schemes in the bandwidth-based family and verify their efficiency on several deep neural network training tasks.
4:
5:
6: % \iff
7: % It is known that step-size is the most important hyper-parameter in machine learning regime, especially for deep neural networks. This paper investigates the bandwidth-based policy which allows the step-size varies in a banded region, hence has the potential benefits for nonconvex optimization. We provide a worst-case theoretical guarantees for SGD on smooth nonconvex problems, under bandwidth step-size, e.g., $1/\sqrt{t}$ in a stagewise manner and the popular \emph{step-decay} (constant and then drop by a constant), which is also optimal. Moreover, we show that its momentum variant (SGDM) converges as fast as SGD with the bandwidth step-decay step-size. The analysis also provides theoretical guarantees for the cyclical step-sizes which lies within the band. Finally, we propose some bandwidth step-sizes and verifies their efficiency on several deep neural network tasks.
8: % \iffalse
9:
10:
11:
12:
13:
14:
15: \end{abstract}
16: