abstract:9618dad1c1831750.tex

1: \begin{abstract}

2:     We study the dynamics of optimization and the generalization properties of one-hidden layer neural networks with quadratic activation function in the over-parametrized regime where the layer width $m$ is larger than the input dimension $d$.

3:     We consider a teacher-student scenario where the teacher has the same structure as the student with a hidden layer of smaller width $m^*\le m$.

4:     We describe how the empirical loss landscape is affected by the number $n$ of data samples and the width $m^*$ of the teacher network. In particular we determine how the probability that there be no spurious minima on the empirical loss depends on $n$, $d$, and $m^*$, thereby establishing conditions under which the neural network can in principle recover the teacher.

5:     We  also show  that under the same conditions gradient descent dynamics on the empirical loss converges and leads to small generalization error, i.e. it enables recovery in practice.

6:     Finally we characterize the time-convergence rate of gradient descent  in the limit of a large number of samples.

7:     These results are confirmed by numerical experiments.

8: \end{abstract}

9: