abstract:a7de38d8238ac65d.tex

1: \begin{abstract}

2: We analyze deep ReLU neural networks trained with mini-batch Stochastic Gradient Descent (SGD) and weight decay.  We show, both theoretically and empirically, that when training a neural network using SGD with weight decay and small batch size, the resulting weight matrices tend to be of small rank. Our analysis relies on a minimal set of assumptions; the neural networks may be arbitrarily wide or deep and may include residual connections, as well as convolutional layers.

3: The same analysis implies the inherent presence of SGD ``noise'', defined as the inability of SGD to converge to a stationary point. In particular, we prove that SGD noise must always be present, even asymptotically, as long as we incorporate weight decay and the batch size is smaller than the total number of training samples.

4:

5: %TP%Furthermore, we study the source of SGD noise and prove that when training with weight decay, the only solutions of SGD at convergence are zero functions.

6: \end{abstract}

7: