9618dad1c1831750.tex
1: \begin{abstract}
2:     We study the dynamics of optimization and the generalization properties of one-hidden layer neural networks with quadratic activation function in the over-parametrized regime where the layer width $m$ is larger than the input dimension $d$. 
3:     We consider a teacher-student scenario where the teacher has the same structure as the student with a hidden layer of smaller width $m^*\le m$. 
4:     We describe how the empirical loss landscape is affected by the number $n$ of data samples and the width $m^*$ of the teacher network. In particular we determine how the probability that there be no spurious minima on the empirical loss depends on $n$, $d$, and $m^*$, thereby establishing conditions under which the neural network can in principle recover the teacher. 
5:     We  also show  that under the same conditions gradient descent dynamics on the empirical loss converges and leads to small generalization error, i.e. it enables recovery in practice.
6:     Finally we characterize the time-convergence rate of gradient descent  in the limit of a large number of samples.
7:     These results are confirmed by numerical experiments.
8: \end{abstract}
9: