1: \begin{abstract}
2: Neural networks, a central tool in machine learning, have
3: demonstrated remarkable, high fidelity performance on image
4: recognition and classification tasks. These successes evince an
5: ability to accurately represent high dimensional functions, but
6: % potentially of great use in computational and applied mathematics.
7: % That said,
8: % Networks, however, require to be optimized or `trained' and
9: rigorous results about the approximation
10: error of neural networks after training are few. Here we establish conditions for global convergence of the standard optimization algorithm used in
11: machine learning applications, stochastic gradient descent (SGD), and quantify the scaling of its error with the size of the network. This is done by reinterpreting SGD as the
12: evolution of a particle system with interactions governed by a
13: potential related to the objective or ``loss'' function used to
14: train the network. We show that, when the number $n$ of units
15: is large, the empirical distribution of the particles descends on a
16: convex landscape towards the global minimum at a rate independent of $n$, with a resulting approximation error that universally scales as
17: $O(n^{-1})$. These properties are established in the form of a Law of Large Numbers and a Central Limit Theorem for
18: the empirical distribution.
19: % and, remarkably, these scaling results do not depend on the
20: % dimensionality of the domain of the function that we seek to
21: % represent.
22: Our analysis also quantifies the scale and nature of the
23: noise introduced by SGD and provides
24: guidelines for the step size and batch size to use when training a
25: neural network. We illustrate our findings on examples in which we
26: train neural networks to learn the energy function of the continuous
27: 3-spin model on the sphere. The approximation error scales as our
28: analysis predicts in as high a dimension as $d=25$.
29: \end{abstract}
30: