abstract:142ab1a2516b0dbe.tex

1: \begin{abstract}

2: Incorporating a so-called ``momentum'' dynamic in gradient descent methods is widely used in neural net training as it has been broadly observed that, at least empirically, it often leads to significantly faster convergence.

3: At the same time, there are very few theoretical guarantees

4: in the literature to explain this apparent acceleration effect.

5: Even for the classical strongly convex quadratic problems, several existing results only show Polyak's momentum has an accelerated linear rate asymptotically.

6: In this paper, we first revisit the quadratic problems and show a non-asymptotic accelerated linear rate of Polyak's momentum.

7: Then, we provably show that Polyak's momentum achieves acceleration for training a one-layer wide ReLU network and a deep linear network,

8: which are perhaps the two most popular canonical models for studying optimization and deep learning in the literature.

9: Prior work \citep{DZPS19,WDW19} showed that using vanilla gradient descent, and with an use of over-parameterization, the error decays as $(1- \Theta(\frac{1}{ \kappa'}))^t$ after $t$ iterations, where $\kappa'$ is the condition number of a Gram Matrix.

10: Our result shows that with the appropriate choice of parameters Polyak's momentum has a rate of $(1-\Theta(\frac{1}{\sqrt{\kappa'}}))^t$.

11: For the deep linear network, prior work \citep{HXP20} showed that

12: vanilla gradient descent has a rate of

13: $(1-\Theta(\frac{1}{\kappa}))^t$, where $\kappa$ is the condition number of a data matrix.

14: Our result shows an acceleration rate $(1- \Theta(\frac{1}{\sqrt{\kappa}}))^t$ is achievable by Polyak's momentum.

15: %All the results in this work are obtained from a modular analysis, which can be of independent interest.

16: This work establishes that momentum does indeed speed up neural net training.

17: \end{abstract}

18: