60b28c9e1b8a46fb.tex
1: \begin{abstract}
2:     We present results of numerical experiments for neural networks with 
3:     stochastic gradient-based optimization with adaptive momentum.
4:     This widely applied optimization has proved convergence and practical efficiency,
5:     but for long-run training becomes numerically unstable.
6:     We show that numerical artifacts are observable not only for large-scale models
7:     and finally lead to divergence also for case of shallow narrow networks.
8:     We argue this theory by experiments with more than $1600$ neural networks
9:     trained for $50000$ epochs.
10:     Local observations show presence of the same behavior of network parameters
11:     in both stable and unstable training segments.
12:     Geometrical behavior of parameters forms double twisted spirals in the parameter
13:     space and is caused by alternating of numerical perturbations with next 
14:     relaxation oscillations in values for $1^{st}$ and $2^{nd}$ momentum.
15: \end{abstract}
16: