abstract:1d982f8fcf506b55.tex

1: \begin{abstract}

2:   Neural networks, a central tool in machine learning, have

3:   demonstrated remarkable, high fidelity performance on image

4:   recognition and classification tasks.  These successes evince an

5:   ability to accurately represent high dimensional functions, but

6: %   potentially of great use in computational and applied mathematics.

7: %   That said,

8: % Networks, however, require to be optimized or `trained' and

9:    rigorous results about the approximation

10:   error  of neural networks after training are few. Here we establish conditions for global convergence of the standard optimization algorithm used in

11:   machine learning applications, stochastic gradient descent (SGD), and quantify the scaling of its error with the size of the network. This is done by reinterpreting SGD as the

12:   evolution of a particle system with interactions governed by a

13:   potential related to the objective or ``loss'' function used to

14:   train the network. We show that, when the number $n$ of units

15:   is large, the empirical distribution of the particles descends on a

16:   convex landscape towards the global minimum at a rate independent of $n$, with a resulting approximation error that universally scales as

17:   $O(n^{-1})$. These properties are established in the form of a Law of Large Numbers and a Central Limit Theorem for

18:   the empirical distribution.

19: %   and, remarkably, these scaling results do not depend on the

20: %   dimensionality of the domain of the function that we seek to

21: %   represent.

22: Our analysis also quantifies the scale and nature of the

23:   noise introduced by SGD and provides

24:   guidelines for the step size and batch size to use when training a

25:   neural network. We illustrate our findings on examples in which we

26:   train neural networks to learn the energy function of the continuous

27:   3-spin model on the sphere.  The approximation error scales as our

28:   analysis predicts in as high a dimension as $d=25$.

29: \end{abstract}

30: