abstract:a3cd72448ae333a2.tex

1: \begin{abstract}

2: %Empirical studies show that deep neural network trained by gradient-based algorithms with random initialization can easily fit real data with random labels. In order to explain this phenomenon,

3: A recent line of research has shown that gradient-based algorithms with random initialization can converge to the global minima of the training loss for over-parameterized (i.e., sufficiently wide) deep neural networks. However, the condition on the width of the neural network to ensure the global convergence is very stringent, which is often a high-degree polynomial in the training sample size $n$ (e.g., $O(n^{24})$). %In practice, %there still exists a huge gap between theory and practice since in practice

4: %a much smaller neural network already has the power to fit any data.

5: In this paper, we provide an improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters. The main technical contributions of our analysis include (a) a tighter gradient lower bound that leads to a faster convergence of the algorithm, and (b) a sharper characterization of the trajectory length of the algorithm. By specializing our result to two-layer (i.e., one-hidden-layer) neural networks, it also provides a milder over-parameterization condition than the best-known result in prior work.

6: \end{abstract}

7: