abstract:d4a95e55217c3374.tex

1: \begin{abstract}

2: Adaptive gradient methods are typically used for training over-parameterized models.

3: To better understand their behaviour, we study a simplistic setting -- smooth, convex losses with models over-parameterized enough to \emph{interpolate} the data.

4: In this setting, we prove that \amsgrad/ with constant step-size and momentum converges to the minimizer at a faster $\bigO(1/T)$ rate.

5: When interpolation is only approximately satisfied, constant step-size \amsgrad/ converges to a neighbourhood of the solution at the same rate, while \adagrad/ is robust to the violation of interpolation.

6: However, even for simple convex problems satisfying interpolation, the empirical performance of both methods heavily depends on the step-size and requires tuning, questioning their adaptivity.

7: We alleviate this problem by automatically determining the step-size using stochastic line-search or Polyak step-sizes.

8: With these techniques, we prove that both \adagrad/ and \amsgrad/ retain their convergence guarantees, without needing to know problem-dependent constants.

9: Empirically, we demonstrate that these techniques improve the convergence and generalization of adaptive gradient methods across tasks, from binary classification with kernel mappings to multi-class classification with deep networks.

10: \end{abstract}

11: