abstract:99fd70130541535b.tex

1: \begin{abstract}%

2: We revisit the problem of learning a single neuron with ReLU activation under Gaussian input with square loss.

3: We particularly focus on the over-parameterization setting where the student network has $n\ge 2$ neurons.

4: We prove the global convergence of randomly initialized gradient descent with a $O\left(T^{-3}\right)$ rate.

5: This is the first global convergence result for this problem beyond the exact-parameterization setting ($n=1$) in which the gradient descent enjoys an $\exp(-\Omega(T))$ rate.

6: Perhaps surprisingly, we further present an $\Omega\left(T^{-3}\right)$ lower bound for randomly initialized gradient flow in the over-parameterization setting.

7: These two bounds jointly give an exact characterization of the convergence rate and imply, for the first time, that \emph{over-parameterization can exponentially slow down the convergence rate}.

8: To prove the global convergence, we need to tackle the interactions among student neurons in the gradient descent dynamics, which are not present in the exact-parameterization case.

9: We use a  three-phase structure to analyze GD's dynamics. Along the way, we prove gradient descent automatically balances student neurons, and use this property to deal with the non-smoothness of the objective function. To prove the convergence rate lower bound, we construct a novel potential function that characterizes the pairwise distances between the student neurons (which cannot be done in the exact-parameterization case).

10: We show this potential function converges slowly, which implies the slow convergence rate of the loss function.

11:

12:

13:

14:

15: \end{abstract}

16: