1: \begin{abstract}
2:
3:
4: {A main puzzle of deep neural networks (DNNs) revolves
5: around the apparent absence of ``overfitting'', defined in
6: this paper as follows: the expected error does not get
7: worse when increasing the number of neurons or of
8: iterations of gradient descent. This is surprising because
9: of the large capacity demonstrated by DNNs to fit randomly
10: labeled data and the absence of explicit
11: regularization. Recent results by
12: \cite{2017arXiv171010345S} provide a satisfying solution
13: of the puzzle for linear networks used in binary
14: classification. They prove that minimization of loss
15: functions such as the logistic, the cross-entropy and the
16: exp-loss yields asymptotic, ``slow'' convergence to the
17: maximum margin solution for linearly separable datasets,
18: independently of the initial conditions. Here we prove a
19: similar result for nonlinear multilayer DNNs near zero
20: minima of the empirical loss. The result holds for
21: exponential-type losses but not for the square loss. In
22: particular, we prove that the weight matrix at
23: each layer of a deep network converges to a minimum norm
24: solution up to a scale factor (in the separable case). Our analysis of the
25: dynamical system corresponding to gradient descent of a
26: multilayer network suggests a simple criterion for
27: ranking the generalization performance of different
28: zero minimizers of the empirical loss. }
29:
30: \end{abstract}
31: