abstract:0874c646d9b1e6c1.tex

1: \begin{abstract}

2:   For finite-sum optimization, variance-reduced gradient methods (VR)

3:   compute at each iteration the gradient of a single function (or of a

4:   mini-batch), and yet achieve faster convergence than SGD thanks to a

5:   carefully crafted lower-variance stochastic gradient estimator that

6:   reuses past gradients. Another important line of research of the

7:   past decade in continuous optimization is the adaptive algorithms

8:   such as AdaGrad, that dynamically adjust the (possibly

9:   coordinate-wise) learning rate to past gradients and thereby adapt to

10:   the geometry of the objective function. Variants such as RMSprop and

11:   Adam demonstrate outstanding practical performance that have

12:   contributed to the success of deep learning. In this work, we

13:   present AdaLVR, which combines the AdaGrad algorithm with

14:   \emph{loopless} variance-reduced gradient estimators such as SAGA or L-SVRG

15:   that benefits from a straightforward construction and a streamlined analysis. We

16:   assess that AdaLVR inherits both good convergence properties from VR

17:   methods and the adaptive nature of AdaGrad: in the case of

18:   $L$-smooth convex functions we establish a gradient complexity of

19:   $O(n+(L+\sqrt{nL})/\varepsilon)$ without prior knowledge of $L$. Numerical

20:   experiments demonstrate the superiority of AdaLVR over

21:   state-of-the-art methods. Moreover, we empirically show that the

22:   RMSprop and Adam algorithm combined with variance-reduced gradients

23:   estimators achieve even faster convergence.

24: \end{abstract}

25: