abstract:96e028975c858009.tex

1: \begin{abstract}

2:

3:

4:           {A main puzzle of deep neural networks (DNNs) revolves

5:             around the apparent absence of ``overfitting'', defined in

6:             this paper as follows: the expected error does not get

7:             worse when increasing the number of neurons or of

8:             iterations of gradient descent. This is surprising because

9:             of the large capacity demonstrated by DNNs to fit randomly

10:             labeled data and the absence of explicit

11:             regularization. Recent results by

12:             \cite{2017arXiv171010345S} provide a satisfying solution

13:             of the puzzle for linear networks used in binary

14:             classification.  They prove that minimization of loss

15:             functions such as the logistic, the cross-entropy and the

16:             exp-loss yields asymptotic, ``slow'' convergence to the

17:             maximum margin solution for linearly separable datasets,

18:             independently of the initial conditions. Here we prove a

19:             similar result for nonlinear multilayer DNNs near zero

20:             minima of the empirical loss.  The result holds for

21:             exponential-type losses but not for the square loss. In

22:             particular, we prove that the weight matrix at

23:             each layer of a deep network converges to a minimum norm

24:             solution up to a scale factor (in the separable case). Our analysis of the

25:             dynamical system corresponding to gradient descent of a

26:             multilayer network suggests a simple criterion for

27:             ranking the generalization performance of different

28:             zero minimizers of the empirical loss.  }

29:

30: 	\end{abstract}

31: