f88b6dafce56d494.tex
1: \begin{abstract}
2: Multi-layer neural networks are among the most powerful models in machine learning,
3: yet the fundamental reasons for this success defy mathematical understanding.
4: Learning a neural network requires to optimize a  non-convex  high-dimensional 
5: objective (risk function), a problem which is usually attacked using stochastic gradient descent (SGD). 
6: Does SGD converge to a global optimum of the risk or only to a local optimum? In the first case,
7: does this happen because local minima are absent, or because SGD somehow avoids them? In the
8: second, why do local minima reached by SGD have good generalization properties?
9: 
10: In this paper we consider a simple case, namely  two-layers neural networks,
11: and prove that --in a suitable scaling limit--  SGD dynamics is captured by a certain non-linear 
12: partial differential equation (PDE) that we call \emph{distributional
13:   dynamics} (DD). We then consider several specific examples, and show
14: how  DD can be used to prove convergence of SGD to networks with 
15: nearly-ideal  generalization error. This description allows to `average-out' some of the complexities
16: of the landscape of neural networks, and can be used to prove a general convergence result for noisy SGD.
17: \end{abstract}
18: