8bc13443c0a0d152.tex
1: \begin{abstract}
2: Neural networks are usually trained by some form of stochastic gradient
3: descent (SGD)). A number of strategies are in common use intended to
4: improve SGD optimization, such as learning rate schedules, momentum, and
5: batching. These are motivated by ideas about the occurrence of local
6: minima at different scales, valleys, and other phenomena in the
7: objective function. Empirical results presented here suggest that these
8: phenomena are not significant factors in SGD optimization of MLP-related
9: objective functions, and that the behavior of stochastic gradient
10: descent in these problems is better described as the simultaneous
11: convergence at different rates of many, largely non-interacting
12: subproblems.
13: \end{abstract}
14: