1098a6c435d6f55b.tex
1: \begin{abstract}
2: We study the application of variance reduction (VR) techniques to
3: general non-convex stochastic optimization problems. In this setting,
4: the recent work STORM \cite{cutkosky2019momentum} overcomes the
5: drawback of having to compute gradients of ``mega-batches'' that
6: earlier VR methods rely on. There, STORM utilizes recursive momentum
7: to achieve the VR effect and is then later made fully adaptive in
8: STORM+ \cite{levy2021storm+}, where full-adaptivity removes the
9: requirement for obtaining certain problem-specific parameters such
10: as the smoothness of the objective and bounds on the variance and
11: norm of the stochastic gradients in order to set the step size. However,
12: STORM+ crucially relies on the assumption that the function values
13: are bounded, excluding a large class of useful functions. In this
14: work, we propose $\algnamenew$, a generalized framework of STORM+
15: that removes this bounded function values assumption while still attaining
16: the optimal convergence rate for non-convex optimization. $\algnamenew$
17: not only maintains full-adaptivity, removing the need to obtain problem
18: specific parameters, but also improves the convergence rate's dependency
19: on the problem parameters. Furthermore, $\algnamenew$ can utilize
20: a large range of parameter settings that subsumes previous methods
21: allowing for more flexibility in a wider range of settings. Finally,
22: we demonstrate the effectiveness of META-STORM through experiments
23: across common deep learning tasks. Our algorithm improves upon the
24: previous work STORM+ and is competitive with widely used algorithms
25: after the addition of per-coordinate update and exponential moving
26: average heuristics.
27: \end{abstract}
28: