abstract:afe9b46a3c4fd897.tex

1: \begin{abstract}

2:

3: Adaptive gradient methods such as Adam have been shown to be very effective for training deep neural networks (DNNs) by tracking the second moment of gradients to compute the individual learning rates. Differently from existing methods, we make use of the most recent first moment of gradients to compute the individual learning rates per iteration. The motivation behind it is that the dynamic variation of the first moment of gradients may  provide useful information to obtain the learning rates.  We refer to the new method as the \emph{rapidly adapting moment estimation (RAME)}. % is By doing so, the new method is able to react to the dynamic variation of the first moment of gradients responsively, which is referred to as the \emph{responsively adaptive moment estimation (Rame)}. %The intensities of the individual learning rates of Rame are controlled by a scalar parameter, which includes the heavy-ball method as a special case by setting the scalar parameter to zero.

4: %One advantage of RAME is that it saves a memory space of the model size in comparison to Adam by avoiding second moment storage, which becomes significant for large-scale DNNs.

5: The theoretical convergence of deterministic RAME is studied by using an analysis similar to the one used in \cite{Adam18Converge} for Adam.  Experimental results for training a number of DNNs show promising performance of RAME w.r.t. the convergence speed and generalization performance compared to the stochastic heavy-ball (SHB) method,  Adam, and RMSprop. % The empirical study confirms that the first moment of gradients indeed helps with achieving either better or equivalent  performance than SHB.

6:

7: %We then consider applying the new algorithm for distributed averaging. For the case of no transmission failure, the new algorithm remarkably outperforms the state-of-the-art methods. For the case of transmission losses, the new algorithm is robust to transmission-failure.

8: \end{abstract}

9: