abstract:b5bf0dc45d26720e.tex

1: \begin{abstract}

2: Adaptive optimization methods such as \textsc{AdaGrad}, \textsc{RMSprop} and \textsc{Adam} have been proposed to achieve a rapid training process with an element-wise scaling term on learning rates.

3: Though prevailing, they are observed to generalize poorly compared with \textsc{Sgd} or even fail to converge due to unstable and extreme learning rates.

4: Recent work has put forward some algorithms such as \textsc{AMSGrad} to tackle this issue but they failed to achieve considerable improvement over existing methods.

5: In our paper, we demonstrate that extreme learning rates can lead to poor performance.

6: We provide new variants of \textsc{Adam} and \textsc{AMSGrad}, called \textsc{AdaBound} and \textsc{AMSBound} respectively, which employ dynamic bounds on learning rates to achieve a gradual and smooth transition from adaptive methods to \textsc{Sgd} and give a theoretical proof of convergence.

7: We further conduct experiments on various popular tasks and models, which is often insufficient in previous work.

8: Experimental results show that new variants can eliminate the generalization gap between adaptive methods and \textsc{Sgd} and maintain higher learning speed early in training at the same time.

9: Moreover, they can bring significant improvement over their prototypes, especially on complex deep networks.

10: The implementation of the algorithm can be found at \url{https://github.com/Luolc/AdaBound}.

11:

12: \end{abstract}

13: