1: \begin{abstract}
2: We present results of numerical experiments for neural networks with
3: stochastic gradient-based optimization with adaptive momentum.
4: This widely applied optimization has proved convergence and practical efficiency,
5: but for long-run training becomes numerically unstable.
6: We show that numerical artifacts are observable not only for large-scale models
7: and finally lead to divergence also for case of shallow narrow networks.
8: We argue this theory by experiments with more than $1600$ neural networks
9: trained for $50000$ epochs.
10: Local observations show presence of the same behavior of network parameters
11: in both stable and unstable training segments.
12: Geometrical behavior of parameters forms double twisted spirals in the parameter
13: space and is caused by alternating of numerical perturbations with next
14: relaxation oscillations in values for $1^{st}$ and $2^{nd}$ momentum.
15: \end{abstract}
16: