abstract:f89b16271138927b.tex

1: \begin{abstract}

2: Fitting a function by using linear combinations of a large number $N$ of `simple' components is

3: one of the most fruitful ideas in statistical learning. This idea lies at the core of a variety of methods, from two-layer neural networks

4: to kernel regression, to boosting.

5: In general, the resulting risk minimization problem is non-convex and  is solved by gradient descent or

6: its variants. Unfortunately, little is known about global convergence properties of

7: these approaches.

8:

9: Here we consider the problem of learning a concave function $f$ on a compact convex domain  $\Omega\subset \reals^d$, using linear combinations of

10: `bump-like' components (neurons). The parameters to be fitted are the centers of $N$ bumps, and the resulting empirical risk minimization

11: problem is highly non-convex.

12: We prove that, in the limit in which the number of neurons diverges, the evolution of

13: gradient descent converges to a Wasserstein gradient flow in the space of probability distributions over $\Omega$.

14: Further, when the bump width $\delta$ tends to $0$, this gradient flow has a

15: limit which is a viscous porous medium equation.

16:  Remarkably, the cost function optimized by this gradient flow exhibits a special property known as \emph{displacement convexity},

17: which implies exponential convergence  rates for $N\to\infty$, $\delta\to 0$.

18:

19: Surprisingly, this asymptotic theory appears to capture well the behavior for moderate values of $\delta, N$. Explaining  this

20: phenomenon, and understanding the dependence on $\delta,N$ in a quantitative manner remains an outstanding challenge.

21:

22: \end{abstract}

23: