abstract:cd3861f420bf60b6.tex

1: \begin{abstract}

2: There has been a recent surge of interest in understanding the convergence of gradient descent (GD)  and stochastic gradient descent (SGD)  in overparameterized neural networks.

3: Most previous works assume that the training data is provided a priori in a batch, while less attention has been paid to the important setting where the training data arrives in a stream.

4: In this paper, we study the streaming data setup and show that with overparamterization and random initialization, the prediction error of two-layer neural networks under one-pass SGD converges in expectation. The convergence rate depends on the eigen-decomposition of the integral operator associated with the so-called neural tangent kernel (NTK). A key step of our analysis is to show a random kernel function converges to the NTK with high probability using the VC dimension and McDiarmid's inequality.

5: \end{abstract}

6: