1: \begin{abstract}
2: This work establishes low test error of gradient flow (GF)
3: and stochastic gradient descent (SGD)
4: on two-layer ReLU networks
5: with standard initialization,
6: in three regimes where key sets of weights rotate little
7: (either naturally due to GF and SGD, or due to an artificial constraint),
8: and making use of margins as the core analytic technique.
9: The first regime is near initialization, specifically until the weights have
10: moved by $\mathcal{O}(\sqrt m)$, where $m$ denotes the network width,
11: which is in sharp contrast to the $\mathcal{O}(1)$ weight motion allowed by the
12: Neural Tangent Kernel (NTK);
13: here it is shown that GF and SGD only need a network width and number of samples
14: inversely proportional to the NTK margin,
15: and moreover that GF attains at least the NTK margin itself,
16: which suffices to establish escape from bad KKT points of the margin objective,
17: whereas prior work could only establish nondecreasing but arbitrarily small margins.
18: The second regime is the Neural Collapse (NC) setting,
19: where data lies in extremely-well-separated groups,
20: and the sample complexity scales with the number of groups;
21: here the contribution over prior work is an analysis of the
22: entire GF trajectory from initialization.
23: Lastly, if the inner layer weights are constrained to change in norm only and
24: can not rotate,
25: then GF with large widths achieves globally maximal margins, and its sample
26: complexity scales with their inverse;
27: this is in contrast to prior work,
28: which required infinite width and a tricky dual convergence assumption.
29: As purely technical contributions, this work
30: develops a variety of potential functions and other tools
31: which will hopefully aid future work.
32: \end{abstract}
33: