142ab1a2516b0dbe.tex
1: \begin{abstract}
2: Incorporating a so-called ``momentum'' dynamic in gradient descent methods is widely used in neural net training as it has been broadly observed that, at least empirically, it often leads to significantly faster convergence.
3: At the same time, there are very few theoretical guarantees
4: in the literature to explain this apparent acceleration effect.
5: Even for the classical strongly convex quadratic problems, several existing results only show Polyak's momentum has an accelerated linear rate asymptotically. 
6: In this paper, we first revisit the quadratic problems and show a non-asymptotic accelerated linear rate of Polyak's momentum. 
7: Then, we provably show that Polyak's momentum achieves acceleration for training a one-layer wide ReLU network and a deep linear network,
8: which are perhaps the two most popular canonical models for studying optimization and deep learning in the literature. 
9: Prior work \citep{DZPS19,WDW19} showed that using vanilla gradient descent, and with an use of over-parameterization, the error decays as $(1- \Theta(\frac{1}{ \kappa'}))^t$ after $t$ iterations, where $\kappa'$ is the condition number of a Gram Matrix.
10: Our result shows that with the appropriate choice of parameters Polyak's momentum has a rate of $(1-\Theta(\frac{1}{\sqrt{\kappa'}}))^t$.
11: For the deep linear network, prior work \citep{HXP20} showed that 
12: vanilla gradient descent has a rate of 
13: $(1-\Theta(\frac{1}{\kappa}))^t$, where $\kappa$ is the condition number of a data matrix.  
14: Our result shows an acceleration rate $(1- \Theta(\frac{1}{\sqrt{\kappa}}))^t$ is achievable by Polyak's momentum.
15: %All the results in this work are obtained from a modular analysis, which can be of independent interest.
16: This work establishes that momentum does indeed speed up neural net training.
17: \end{abstract}
18: