96e028975c858009.tex
1: \begin{abstract}
2: 		
3: 		
4:           {A main puzzle of deep neural networks (DNNs) revolves
5:             around the apparent absence of ``overfitting'', defined in
6:             this paper as follows: the expected error does not get
7:             worse when increasing the number of neurons or of
8:             iterations of gradient descent. This is surprising because
9:             of the large capacity demonstrated by DNNs to fit randomly
10:             labeled data and the absence of explicit
11:             regularization. Recent results by
12:             \cite{2017arXiv171010345S} provide a satisfying solution
13:             of the puzzle for linear networks used in binary
14:             classification.  They prove that minimization of loss
15:             functions such as the logistic, the cross-entropy and the
16:             exp-loss yields asymptotic, ``slow'' convergence to the
17:             maximum margin solution for linearly separable datasets,
18:             independently of the initial conditions. Here we prove a
19:             similar result for nonlinear multilayer DNNs near zero
20:             minima of the empirical loss.  The result holds for
21:             exponential-type losses but not for the square loss. In
22:             particular, we prove that the weight matrix at
23:             each layer of a deep network converges to a minimum norm
24:             solution up to a scale factor (in the separable case). Our analysis of the
25:             dynamical system corresponding to gradient descent of a
26:             multilayer network suggests a simple criterion for
27:             ranking the generalization performance of different
28:             zero minimizers of the empirical loss.  }
29: 		
30: 	\end{abstract}
31: