abstract:4891b0c60b098f04.tex

1: \begin{abstract}The adaptive moment estimation \textcolor{black}{algorithm} Adam (Kingma and Ba) is a popular optimizer in the training of deep neural networks. However, Reddi et al. have recently shown that the convergence proof of Adam is problematic and proposed a variant of Adam called AMSGrad as a fix. In this paper, we show that  the convergence proof of AMSGrad is also problematic. {Concretely, the problem in the convergence proof of AMSGrad is in handling the hyper-parameters, treating them as equal while they are not. This is also the neglected issue in the convergence proof of Adam. We provide an explicit counter-example of a simple convex optimization setting to show this neglected issue. Depending on manipulating the hyper-parameters}, we present various fixes for this issue. {We provide a new convergence proof for AMSGrad as the first fix}. We also propose a new version of AMSGrad called AdamX as another fix. {Our experiments on the benchmark dataset also support our theoretical results.}

2:

3: \medskip

4: \noindent {\sc \keywordsname.}  Optimizer, adaptive moment estimation, Adam, AMSGrad, deep neural networks.

5: \end{abstract}

6: