abstract:62fc63be8b86fce4.tex

1: \begin{abstract}%

2: % Understanding AdaGrad is the initial point towards understanding adaptive optimizers over non-convex landscape. However, the convergence analysis of AdaGrad has not been well-established: all the existing works suffer either from unrealistic assumptions or from sophisticated proof and (as a result) a sub-optimal rate in the so-called over-parameterized regime. In this paper, we provide a refined analysis for AdaGrad. Specifically, we find a novel auxillary function $\xi$, based on which we are able to derive a simple convergence analysis of AdaGrad only assuming affine noise variance and bounded smoothness. Our result shows that in the over-parameterized regime, AdaGrad only needs $\mathcal{O}(\frac{1}{\varepsilon^2})$ to ensure that the gradient norm is smaller than $\varepsilon$, which matches that of SGD and is the first result of this kind. We then try to step beyond the

3: % uniformly smooth landscape, and consider a simple yet realistic non-uniformly smooth condition, called $(L_0,L_1)$-smooth condition. Again based on the auxillary function $\xi$, we prove that AdaGrad(-Norm) succeeds to converge under such a condition while requiring the learning rate to be smaller than a threshold. We show such a requirement is essential through a counterexample, and demonstrate that AdaGrad loses the tuning-free ability when the smoothness is no longer bounded. Together, our analyses broaden the understanding of AdaGrad, and suggest that the auxillary function $\xi$ to be a powerful tool in the analyses of AdaGrad.

4: % \end{abstract}

5: