1: \begin{abstract}
2: Variance reduction (VR) techniques have contributed significantly to accelerating learning with massive datasets in the smooth and strongly convex setting~\cite{schmidt2017minimizing, johnson2013accelerating, roux2012stochastic}. However, such techniques have not yet met the same success in the realm of large-scale deep learning due to various factors such as the use of data augmentation or regularization methods like dropout~\cite{defazio2019ineffectiveness}.
3: This challenge has recently motivated the design of novel variance reduction techniques tailored explicitly for deep learning~\cite{arnold2019reducing, ma2018quasi}. This work is an additional step in this direction. In particular, we exploit the ubiquitous clustering structure of rich datasets used in deep learning to design a family of scalable variance reduced optimization procedures by combining existing optimizers (e.g., SGD+Momentum, Quasi Hyperbolic Momentum, Implicit Gradient Transport) with a multi-momentum strategy~\cite{yuan2019cover}. Our proposal leads to faster convergence than vanilla methods on standard benchmark datasets (e.g., CIFAR and ImageNet). It is robust to label noise and amenable to distributed optimization. We provide a parallel implementation in JAX.
4: \end{abstract}
5: