abstract:0b1bd74502772043.tex

1: \begin{abstract}

2: Adaptive gradient methods are workhorses in deep learning. However, the convergence guarantees of adaptive gradient methods for nonconvex optimization have not been sufficiently studied. In this paper, we provide a sharp analysis of a recently proposed adaptive gradient method namely partially adaptive momentum estimation method (Padam) \citep{chen2018closing}, which admits many existing adaptive gradient methods such as RMSProp and AMSGrad as special cases. Our analysis shows that, for smooth nonconvex  functions, Padam converges to a first-order stationary point at the rate of $O\big((\sum_{i=1}^d\|\mathbf{g}_{1:T,i}\|_2)^{1/2}/T^{3/4} + d/T\big)$, where $T$ is the number of iterations, $d$ is the dimension, $\mathbf{g}_1,\ldots,\mathbf{g}_T$ are the stochastic gradients, and $\mathbf{g}_{1:T,i} = [g_{1,i},g_{2,i},\ldots,g_{T,i}]^\top$.

3: Our theoretical result also suggests that in order to achieve faster convergence rate, it is necessary to use Padam instead of AMSGrad. This is well-aligned with the empirical results of deep learning reported in \citet{chen2018closing}.

4: %Our theoretical result also suggests that in order to achieve strictly faster convergence rate than SGD, it is necessary to use Padam instead of AMSGrad. This is well-aligned with the empirical results of deep learning reported in \citet{chen2018closing}.

5: %which \CC{in the worst case} reduces to $O\big(1/\sqrt{T}\big)$ and matches the convergence rate of vanilla stochastic gradient descent (SGD) for smooth nonconvex  optimization.

6: \end{abstract}

7: