1: \begin{abstract}
2: Adaptive optimization methods such as \textsc{AdaGrad}, \textsc{RMSprop} and \textsc{Adam} have been proposed to achieve a rapid training process with an element-wise scaling term on learning rates.
3: Though prevailing, they are observed to generalize poorly compared with \textsc{Sgd} or even fail to converge due to unstable and extreme learning rates.
4: Recent work has put forward some algorithms such as \textsc{AMSGrad} to tackle this issue but they failed to achieve considerable improvement over existing methods.
5: In our paper, we demonstrate that extreme learning rates can lead to poor performance.
6: We provide new variants of \textsc{Adam} and \textsc{AMSGrad}, called \textsc{AdaBound} and \textsc{AMSBound} respectively, which employ dynamic bounds on learning rates to achieve a gradual and smooth transition from adaptive methods to \textsc{Sgd} and give a theoretical proof of convergence.
7: We further conduct experiments on various popular tasks and models, which is often insufficient in previous work.
8: Experimental results show that new variants can eliminate the generalization gap between adaptive methods and \textsc{Sgd} and maintain higher learning speed early in training at the same time.
9: Moreover, they can bring significant improvement over their prototypes, especially on complex deep networks.
10: The implementation of the algorithm can be found at \url{https://github.com/Luolc/AdaBound}.
11:
12: \end{abstract}
13: