1: \begin{abstract}
2: Multi-layer neural networks are among the most powerful models in machine learning,
3: yet the fundamental reasons for this success defy mathematical understanding.
4: Learning a neural network requires to optimize a non-convex high-dimensional
5: objective (risk function), a problem which is usually attacked using stochastic gradient descent (SGD).
6: Does SGD converge to a global optimum of the risk or only to a local optimum? In the first case,
7: does this happen because local minima are absent, or because SGD somehow avoids them? In the
8: second, why do local minima reached by SGD have good generalization properties?
9:
10: In this paper we consider a simple case, namely two-layers neural networks,
11: and prove that --in a suitable scaling limit-- SGD dynamics is captured by a certain non-linear
12: partial differential equation (PDE) that we call \emph{distributional
13: dynamics} (DD). We then consider several specific examples, and show
14: how DD can be used to prove convergence of SGD to networks with
15: nearly-ideal generalization error. This description allows to `average-out' some of the complexities
16: of the landscape of neural networks, and can be used to prove a general convergence result for noisy SGD.
17: \end{abstract}
18: