abstract:f0ce1509735f5fd6.tex

1: \begin{abstract}

2:   This work establishes low test error of gradient flow (GF)

3:   and stochastic gradient descent (SGD)

4:   on two-layer ReLU networks

5:   with standard initialization,

6:   in three regimes where key sets of weights rotate little

7:   (either naturally due to GF and SGD, or due to an artificial constraint),

8:   and making use of margins as the core analytic technique.

9:   The first regime is near initialization, specifically until the weights have

10:   moved by $\mathcal{O}(\sqrt m)$, where $m$ denotes the network width,

11:   which is in sharp contrast to the $\mathcal{O}(1)$ weight motion allowed by the

12:   Neural Tangent Kernel (NTK);

13:   here it is shown that GF and SGD only need a network width and number of samples

14:   inversely proportional to the NTK margin,

15:   and moreover that GF attains at least  the NTK margin itself,

16:   which suffices to establish escape from bad KKT points of the margin objective,

17:   whereas prior work could only establish nondecreasing but arbitrarily small margins.

18:   The second regime is the Neural Collapse (NC) setting,

19:   where data lies in extremely-well-separated groups,

20:   and the sample complexity scales with the number of groups;

21:   here the contribution over prior work is an analysis of the

22:   entire GF trajectory from initialization.

23:   Lastly, if the inner layer weights are constrained to change in norm only and

24:   can not rotate,

25:   then GF with large widths achieves globally maximal margins, and its sample

26:   complexity scales with their inverse;

27:   this is in contrast to prior work,

28:   which required infinite width and a tricky dual convergence assumption.

29:   As purely technical contributions, this work

30:   develops a variety of potential functions and other tools

31:   which will hopefully aid future work.

32: \end{abstract}

33: