a22e46d235729ebc.tex
1: \begin{abstract}
2: 		
3:           Deep ReLU networks trained  with the square loss have been observed to perform well in
4:           classification tasks. We provide here a theoretical
5:           justification based on analysis of the associated gradient
6:           flow. We show that convergence to a
7:           solution with the absolute minimum norm is expected when
8:           normalization techniques such as Batch
9:           Normalization (BN) or Weight
10:           Normalization (WN) are used together with Weight Decay
11:           (WD). The main property of the minimizers
12:           that bounds their expected error is the norm: we prove that
13:           among all the close-to-interpolating solutions, the ones associated
14:           with smaller Frobenius norms of the unnormalized weight matrices have
15:           better margin and better bounds on the expected
16:           classification error. With BN but in the absence of WD, the dynamical
17:           system is singular. Implicit dynamical regularization --
18:           that is zero-initial conditions biasing the dynamics towards
19:           high margin solutions -- is also possible in the no-BN and
20:           no-WD case.  The theory yields several predictions,
21:           including the role of BN and weight decay, aspects of Papyan,
22:           Han and Donoho's Neural Collapse and the constraints induced
23:           by BN on the network weights.
24: 
25: \end{abstract}
26: