f89b16271138927b.tex
1: \begin{abstract}
2: Fitting a function by using linear combinations of a large number $N$ of `simple' components is 
3: one of the most fruitful ideas in statistical learning. This idea lies at the core of a variety of methods, from two-layer neural networks
4: to kernel regression, to boosting. 
5: In general, the resulting risk minimization problem is non-convex and  is solved by gradient descent or
6: its variants. Unfortunately, little is known about global convergence properties of
7: these approaches.
8: 
9: Here we consider the problem of learning a concave function $f$ on a compact convex domain  $\Omega\subset \reals^d$, using linear combinations of
10: `bump-like' components (neurons). The parameters to be fitted are the centers of $N$ bumps, and the resulting empirical risk minimization 
11: problem is highly non-convex.
12: We prove that, in the limit in which the number of neurons diverges, the evolution of 
13: gradient descent converges to a Wasserstein gradient flow in the space of probability distributions over $\Omega$.
14: Further, when the bump width $\delta$ tends to $0$, this gradient flow has a
15: limit which is a viscous porous medium equation. 
16:  Remarkably, the cost function optimized by this gradient flow exhibits a special property known as \emph{displacement convexity},
17: which implies exponential convergence  rates for $N\to\infty$, $\delta\to 0$.
18: 
19: Surprisingly, this asymptotic theory appears to capture well the behavior for moderate values of $\delta, N$. Explaining  this
20: phenomenon, and understanding the dependence on $\delta,N$ in a quantitative manner remains an outstanding challenge.
21: 
22: \end{abstract}
23: