1: \begin{abstract}%
2: We revisit the problem of learning a single neuron with ReLU activation under Gaussian input with square loss.
3: We particularly focus on the over-parameterization setting where the student network has $n\ge 2$ neurons.
4: We prove the global convergence of randomly initialized gradient descent with a $O\left(T^{-3}\right)$ rate.
5: This is the first global convergence result for this problem beyond the exact-parameterization setting ($n=1$) in which the gradient descent enjoys an $\exp(-\Omega(T))$ rate.
6: Perhaps surprisingly, we further present an $\Omega\left(T^{-3}\right)$ lower bound for randomly initialized gradient flow in the over-parameterization setting.
7: These two bounds jointly give an exact characterization of the convergence rate and imply, for the first time, that \emph{over-parameterization can exponentially slow down the convergence rate}.
8: To prove the global convergence, we need to tackle the interactions among student neurons in the gradient descent dynamics, which are not present in the exact-parameterization case.
9: We use a three-phase structure to analyze GD's dynamics. Along the way, we prove gradient descent automatically balances student neurons, and use this property to deal with the non-smoothness of the objective function. To prove the convergence rate lower bound, we construct a novel potential function that characterizes the pairwise distances between the student neurons (which cannot be done in the exact-parameterization case).
10: We show this potential function converges slowly, which implies the slow convergence rate of the loss function.
11:
12:
13:
14:
15: \end{abstract}
16: