1: \begin{abstract}
2: Adaptive gradient methods are typically used for training over-parameterized models.
3: To better understand their behaviour, we study a simplistic setting -- smooth, convex losses with models over-parameterized enough to \emph{interpolate} the data.
4: In this setting, we prove that \amsgrad/ with constant step-size and momentum converges to the minimizer at a faster $\bigO(1/T)$ rate.
5: When interpolation is only approximately satisfied, constant step-size \amsgrad/ converges to a neighbourhood of the solution at the same rate, while \adagrad/ is robust to the violation of interpolation.
6: However, even for simple convex problems satisfying interpolation, the empirical performance of both methods heavily depends on the step-size and requires tuning, questioning their adaptivity.
7: We alleviate this problem by automatically determining the step-size using stochastic line-search or Polyak step-sizes.
8: With these techniques, we prove that both \adagrad/ and \amsgrad/ retain their convergence guarantees, without needing to know problem-dependent constants.
9: Empirically, we demonstrate that these techniques improve the convergence and generalization of adaptive gradient methods across tasks, from binary classification with kernel mappings to multi-class classification with deep networks.
10: \end{abstract}
11: