27b85d075f72c80c.tex
1: \begin{abstract}
2: %The optimization of neural networks mainly relies on the 
3: %gradient-based methods.
4: %In practice, gradient descent with momentum term is commonly used for its fast convergence.
5: Momentum methods, including heavy-ball~(HB) and Nesterov's accelerated gradient~(NAG), are widely used in training neural networks for their fast convergence.
6: However, there is a lack of theoretical guarantees for their convergence and acceleration since the optimization landscape of the neural network is non-convex.
7: Nowadays, some works make progress towards understanding the convergence of momentum methods in an over-parameterized regime, where the number of the parameters exceeds that of the training instances.
8: Nonetheless, current results mainly focus on the two-layer neural network, which are far from explaining the remarkable success of the momentum methods in training deep neural networks.
9: Motivated by this, we investigate the convergence of NAG with constant learning rate and  momentum parameter in training two architectures of deep linear networks: deep fully-connected linear neural networks and deep linear ResNets.
10: Based on the over-parameterization regime, we first analyze the residual dynamics induced by the training trajectory of NAG for a deep fully-connected linear neural network under the random Gaussian initialization.
11: Our results show that NAG can converge to the global minimum at a $(1 - \O(1/\sqrt{\kappa}))^t$ rate, where $t$ is the iteration number and $\kappa > 1$ is a constant depending on the condition number of the feature matrix.
12: Compared to the $(1 - \O(1/{\kappa}))^t$ rate of GD, NAG achieves an acceleration over GD.
13: To the best of our knowledge, this is the first theoretical guarantee for the convergence of NAG to the global minimum in training deep neural networks.
14: Furthermore, we extend our analysis to deep linear ResNets and derive a similar convergence result.
15: \end{abstract}
16: