abstract:ee52f1523abcc860.tex

1: \begin{abstract}

2: % We propose ACProp (Asycnhornous-Centering-Prop), an adaptive optimizer which combines centering of second momentum (as in RMSprop and AdaBelief) and asynchronous update (for $t$-th step, denominator uses information up to step $t-1$, while numerator uses gradient at $t$-th step). Compared with previous methods, ACProp has both better theoretical properties and empirical performance. We show that ACProp has weaker convergence conditions: for the counter example by Reddi et al. (2019), asynchornous optimizers (e.g. AdaShift, ACProp) converge $\forall \beta_1, \beta_2 \in (0,1)$, while synchronous optimizers (Adam, AdaBelief RMSprop) could diverge;

3: % for asynchronous optimizers, with an example of online-convex problem with sparse gradients, we show centering of second momentum leads to a larger area of convergence in the hyper-parameter space.

4: % We demonstrate that ACProp has a convergence rate of $O(\frac{1}{\sqrt{T}})$ for stochastic non-convex case, which matches the optimal rate and outperforms the $O(\frac{logT}{\sqrt{T}})$ rate of RMSProp and Adam. We validate ACProp in extensive empirical studies: ACProp outperforms both SGD and other adaptive optimizers for image classification with CNN; in the training of various GAN models, ACProp outperforms well-tuned Adam, AdaShift, RAdam and AdaBelief, demonstrating both numerical stability and good generalization; ACProp also achieves good performance in reinforcement learning and training of transformers. To sum up, ACProp enjoys both good theoretical properties (weak convergence condition, optimal convergence rate) and strong empirical performance (numerical stability and fast convergence as adaptive optimizers, and good generalization performance as SGD).

5: % \end{abstract}

6: