abstract:4ffc2c29922d934a.tex

1: \begin{abstract}

2: The dynamic behavior of RMSprop and Adam algorithms is studied through a combination of careful numerical experiments and theoretical explanations.

3: Three types of qualitative features are observed  in the training loss curve: fast initial convergence,  oscillations and large spikes.

4: The sign gradient descent (signGD) algorithm, which is the limit of

5: Adam when taking the learning rate to $0$ while keeping the momentum parameters fixed,

6: is used to explain the fast initial convergence.

7: For the late phase of Adam, three different types of qualitative patterns are observed depending on the choice of the hyper-parameters:

8: oscillations, spikes and divergence.

9: In particular, Adam converges faster and smoother when the values of the two momentum factors are close to each other.

10: \end{abstract}

11: