0874c646d9b1e6c1.tex
1: \begin{abstract}
2:   For finite-sum optimization, variance-reduced gradient methods (VR)
3:   compute at each iteration the gradient of a single function (or of a
4:   mini-batch), and yet achieve faster convergence than SGD thanks to a
5:   carefully crafted lower-variance stochastic gradient estimator that
6:   reuses past gradients. Another important line of research of the
7:   past decade in continuous optimization is the adaptive algorithms
8:   such as AdaGrad, that dynamically adjust the (possibly
9:   coordinate-wise) learning rate to past gradients and thereby adapt to
10:   the geometry of the objective function. Variants such as RMSprop and
11:   Adam demonstrate outstanding practical performance that have
12:   contributed to the success of deep learning. In this work, we
13:   present AdaLVR, which combines the AdaGrad algorithm with
14:   \emph{loopless} variance-reduced gradient estimators such as SAGA or L-SVRG
15:   that benefits from a straightforward construction and a streamlined analysis. We
16:   assess that AdaLVR inherits both good convergence properties from VR
17:   methods and the adaptive nature of AdaGrad: in the case of
18:   $L$-smooth convex functions we establish a gradient complexity of
19:   $O(n+(L+\sqrt{nL})/\varepsilon)$ without prior knowledge of $L$. Numerical
20:   experiments demonstrate the superiority of AdaLVR over
21:   state-of-the-art methods. Moreover, we empirically show that the
22:   RMSprop and Adam algorithm combined with variance-reduced gradients
23:   estimators achieve even faster convergence.
24: \end{abstract}
25: