abstract:60b28c9e1b8a46fb.tex

1: \begin{abstract}

2:     We present results of numerical experiments for neural networks with

3:     stochastic gradient-based optimization with adaptive momentum.

4:     This widely applied optimization has proved convergence and practical efficiency,

5:     but for long-run training becomes numerically unstable.

6:     We show that numerical artifacts are observable not only for large-scale models

7:     and finally lead to divergence also for case of shallow narrow networks.

8:     We argue this theory by experiments with more than $1600$ neural networks

9:     trained for $50000$ epochs.

10:     Local observations show presence of the same behavior of network parameters

11:     in both stable and unstable training segments.

12:     Geometrical behavior of parameters forms double twisted spirals in the parameter

13:     space and is caused by alternating of numerical perturbations with next

14:     relaxation oscillations in values for $1^{st}$ and $2^{nd}$ momentum.

15: \end{abstract}

16: