f0ce1509735f5fd6.tex
1: \begin{abstract}
2:   This work establishes low test error of gradient flow (GF)
3:   and stochastic gradient descent (SGD)
4:   on two-layer ReLU networks
5:   with standard initialization,
6:   in three regimes where key sets of weights rotate little
7:   (either naturally due to GF and SGD, or due to an artificial constraint),
8:   and making use of margins as the core analytic technique.
9:   The first regime is near initialization, specifically until the weights have
10:   moved by $\mathcal{O}(\sqrt m)$, where $m$ denotes the network width,
11:   which is in sharp contrast to the $\mathcal{O}(1)$ weight motion allowed by the
12:   Neural Tangent Kernel (NTK);
13:   here it is shown that GF and SGD only need a network width and number of samples
14:   inversely proportional to the NTK margin,
15:   and moreover that GF attains at least  the NTK margin itself,
16:   which suffices to establish escape from bad KKT points of the margin objective,
17:   whereas prior work could only establish nondecreasing but arbitrarily small margins.
18:   The second regime is the Neural Collapse (NC) setting,
19:   where data lies in extremely-well-separated groups,
20:   and the sample complexity scales with the number of groups;
21:   here the contribution over prior work is an analysis of the
22:   entire GF trajectory from initialization.
23:   Lastly, if the inner layer weights are constrained to change in norm only and 
24:   can not rotate,
25:   then GF with large widths achieves globally maximal margins, and its sample
26:   complexity scales with their inverse;
27:   this is in contrast to prior work,
28:   which required infinite width and a tricky dual convergence assumption.
29:   As purely technical contributions, this work
30:   develops a variety of potential functions and other tools
31:   which will hopefully aid future work.
32: \end{abstract}
33: