abstract:a22e46d235729ebc.tex

1: \begin{abstract}

2:

3:           Deep ReLU networks trained  with the square loss have been observed to perform well in

4:           classification tasks. We provide here a theoretical

5:           justification based on analysis of the associated gradient

6:           flow. We show that convergence to a

7:           solution with the absolute minimum norm is expected when

8:           normalization techniques such as Batch

9:           Normalization (BN) or Weight

10:           Normalization (WN) are used together with Weight Decay

11:           (WD). The main property of the minimizers

12:           that bounds their expected error is the norm: we prove that

13:           among all the close-to-interpolating solutions, the ones associated

14:           with smaller Frobenius norms of the unnormalized weight matrices have

15:           better margin and better bounds on the expected

16:           classification error. With BN but in the absence of WD, the dynamical

17:           system is singular. Implicit dynamical regularization --

18:           that is zero-initial conditions biasing the dynamics towards

19:           high margin solutions -- is also possible in the no-BN and

20:           no-WD case.  The theory yields several predictions,

21:           including the role of BN and weight decay, aspects of Papyan,

22:           Han and Donoho's Neural Collapse and the constraints induced

23:           by BN on the network weights.

24:

25: \end{abstract}

26: