1: \begin{abstract}
2:
3: Deep ReLU networks trained with the square loss have been observed to perform well in
4: classification tasks. We provide here a theoretical
5: justification based on analysis of the associated gradient
6: flow. We show that convergence to a
7: solution with the absolute minimum norm is expected when
8: normalization techniques such as Batch
9: Normalization (BN) or Weight
10: Normalization (WN) are used together with Weight Decay
11: (WD). The main property of the minimizers
12: that bounds their expected error is the norm: we prove that
13: among all the close-to-interpolating solutions, the ones associated
14: with smaller Frobenius norms of the unnormalized weight matrices have
15: better margin and better bounds on the expected
16: classification error. With BN but in the absence of WD, the dynamical
17: system is singular. Implicit dynamical regularization --
18: that is zero-initial conditions biasing the dynamics towards
19: high margin solutions -- is also possible in the no-BN and
20: no-WD case. The theory yields several predictions,
21: including the role of BN and weight decay, aspects of Papyan,
22: Han and Donoho's Neural Collapse and the constraints induced
23: by BN on the network weights.
24:
25: \end{abstract}
26: