abstract:8bc13443c0a0d152.tex

1: \begin{abstract}

2: Neural networks are usually trained by some form of stochastic gradient

3: descent (SGD)). A number of strategies are in common use intended to

4: improve SGD optimization, such as learning rate schedules, momentum, and

5: batching. These are motivated by ideas about the occurrence of local

6: minima at different scales, valleys, and other phenomena in the

7: objective function. Empirical results presented here suggest that these

8: phenomena are not significant factors in SGD optimization of MLP-related

9: objective functions, and that the behavior of stochastic gradient

10: descent in these problems is better described as the simultaneous

11: convergence at different rates of many, largely non-interacting

12: subproblems.

13: \end{abstract}

14: