abstract:f88b6dafce56d494.tex

1: \begin{abstract}

2: Multi-layer neural networks are among the most powerful models in machine learning,

3: yet the fundamental reasons for this success defy mathematical understanding.

4: Learning a neural network requires to optimize a  non-convex  high-dimensional

5: objective (risk function), a problem which is usually attacked using stochastic gradient descent (SGD).

6: Does SGD converge to a global optimum of the risk or only to a local optimum? In the first case,

7: does this happen because local minima are absent, or because SGD somehow avoids them? In the

8: second, why do local minima reached by SGD have good generalization properties?

9:

10: In this paper we consider a simple case, namely  two-layers neural networks,

11: and prove that --in a suitable scaling limit--  SGD dynamics is captured by a certain non-linear

12: partial differential equation (PDE) that we call \emph{distributional

13:   dynamics} (DD). We then consider several specific examples, and show

14: how  DD can be used to prove convergence of SGD to networks with

15: nearly-ideal  generalization error. This description allows to `average-out' some of the complexities

16: of the landscape of neural networks, and can be used to prove a general convergence result for noisy SGD.

17: \end{abstract}

18: