4ffc2c29922d934a.tex
1: \begin{abstract}
2: The dynamic behavior of RMSprop and Adam algorithms is studied through a combination of careful numerical experiments and theoretical explanations.
3: Three types of qualitative features are observed  in the training loss curve: fast initial convergence,  oscillations and large spikes.
4: The sign gradient descent (signGD) algorithm, which is the limit of
5: Adam when taking the learning rate to $0$ while keeping the momentum parameters fixed, 
6: is used to explain the fast initial convergence.
7: For the late phase of Adam, three different types of qualitative patterns are observed depending on the choice of the hyper-parameters: 
8: oscillations, spikes and divergence.
9: In particular, Adam converges faster and smoother when the values of the two momentum factors are close to each other.
10: \end{abstract}
11: