dfb3560ec8746b78.tex
1: \begin{abstract}
2: We analyze deep ReLU neural networks trained with mini-batch Stochastic Gradient Descent (SGD) and weight decay.  We show, both theoretically and empirically, that when training a neural network using SGD with weight decay and small batch size, the resulting weight matrices are expected to be of small rank. Our analysis relies on a minimal set of assumptions and the neural networks may be arbitrarily wide or deep, and may include residual connections, as well as batch normalization and convolutional layers. Furthermore, we suggest that the asymptotic presence of SGD "noise" is also due to the bias towards small rank. In particular, we prove that when training with weight decay, SGD yields low rank minimizers that cannot interpolate all the training data.
3: %TP%Furthermore, we study the source of SGD noise and prove that when training with weight decay, the only solutions of SGD at convergence are zero functions.
4: \end{abstract}
5: