1: \begin{abstract}%
2: %We provide a comprehensive analysis of the Stochastic Heavy Ball (SHB) method (otherwise known as the momentum method), including a convergence of the last iterate of SHB, establishing a faster rate of convergence than existing bounds on the last iterate of Stochastic Gradient Descent (SGD) in the convex setting.
3: % Our analysis shows that unlike SGD, no final iterate averaging is necessary with the SHB method, which is due to the implicit iterate averaging of SHB. We detail new iteration dependent step sizes (learning rates) and momentum parameters for SHB that result in this fast convergence. Moreover, in the overparametrized and deterministic settings, assuming only smoothness and convexity, we prove that the iterates of SHB converge \textit{almost surely} to a minimizer, and that the function values at the last iterate of (S)HB converge \textit{asymptotically} faster than $O(1/k)$ with an improved rate of $o(1/k)$. We prove the same (new) results for \textit{a weighted average} of the iterates of SGD, which rely on an entirely different analysis compared to that of the known $o(1/k)$ convergence of (determistic) gradient descent.
4: %Our analysis is general, in that it includes all forms of mini-batching and non-uniform samplings as a special case, using an arbitrary sampling framework. Furthermore, our analysis does not rely on the bounded gradient assumptions. Instead, it only relies on smoothness, which is an assumption that can be more readily verified. Finally, we present extensive numerical experiments that show that our theoretically motivated parameter settings give a statistically significant faster convergence across a diverse collection of datasets.
5: %\end{abstract}
6: