abstract:9ac525162ce0d577.tex

1: \begin{abstract}%   <- trailing '%' for backward compatibility of .sty file

2: Ill-conditioned problems are ubiquitous in large-scale machine learning:

3: as a dataset grows to include more and more features correlated with the labels,

4: the condition number increases.

5: Yet traditional stochastic gradient methods

6: converge slowly on these ill-conditioned problems,

7: even with careful hyperparameter tuning.

8: This paper introduces PROMISE (\textbf{Pr}econditioned Stochastic \textbf{O}ptimization \textbf{M}ethods by \textbf{I}ncorporating \textbf{S}calable Curvature \textbf{E}stimates), a suite of sketching-based preconditioned stochastic gradient algorithms

9: that deliver fast convergence on ill-conditioned large-scale convex optimization problems arising in machine learning.

10: PROMISE includes preconditioned versions of SVRG, SAGA, and Katyusha;

11: each algorithm comes with a strong theoretical analysis and

12: effective default hyperparameter values.

13: % In contrast, traditional stochastic gradient methods

14: % require careful hyperparameter tuning to succeed,

15: % and degrade in the presence of ill-conditioning,

16: % a ubiquitous phenomenon in machine learning.

17: Empirically, we verify the superiority of the proposed algorithms

18: by showing that, using default hyperparameter values,

19: they outperform or match popular \emph{tuned} stochastic gradient optimizers

20: on a test bed of $51$ ridge and logistic regression problems

21: assembled from benchmark machine learning repositories.

22: On the theoretical side, this paper introduces the notion of \emph{quadratic regularity}

23: in order to establish linear convergence of all proposed methods

24: even when the preconditioner is updated infrequently.

25: The speed of linear convergence is determined by the \emph{quadratic regularity ratio},

26: which often provides a tighter bound on the convergence rate compared to the condition number,

27: both in theory and in practice,

28: and explains the fast global linear convergence of the proposed methods.

29: \end{abstract}

30: