abstract:27b85d075f72c80c.tex

1: \begin{abstract}

2: %The optimization of neural networks mainly relies on the

3: %gradient-based methods.

4: %In practice, gradient descent with momentum term is commonly used for its fast convergence.

5: Momentum methods, including heavy-ball~(HB) and Nesterov's accelerated gradient~(NAG), are widely used in training neural networks for their fast convergence.

6: However, there is a lack of theoretical guarantees for their convergence and acceleration since the optimization landscape of the neural network is non-convex.

7: Nowadays, some works make progress towards understanding the convergence of momentum methods in an over-parameterized regime, where the number of the parameters exceeds that of the training instances.

8: Nonetheless, current results mainly focus on the two-layer neural network, which are far from explaining the remarkable success of the momentum methods in training deep neural networks.

9: Motivated by this, we investigate the convergence of NAG with constant learning rate and  momentum parameter in training two architectures of deep linear networks: deep fully-connected linear neural networks and deep linear ResNets.

10: Based on the over-parameterization regime, we first analyze the residual dynamics induced by the training trajectory of NAG for a deep fully-connected linear neural network under the random Gaussian initialization.

11: Our results show that NAG can converge to the global minimum at a $(1 - \O(1/\sqrt{\kappa}))^t$ rate, where $t$ is the iteration number and $\kappa > 1$ is a constant depending on the condition number of the feature matrix.

12: Compared to the $(1 - \O(1/{\kappa}))^t$ rate of GD, NAG achieves an acceleration over GD.

13: To the best of our knowledge, this is the first theoretical guarantee for the convergence of NAG to the global minimum in training deep neural networks.

14: Furthermore, we extend our analysis to deep linear ResNets and derive a similar convergence result.

15: \end{abstract}

16: