f9cff7ed83c50941.tex
1: \begin{abstract}
2: While the convergence behaviors of stochastic gradient methods are
3: well understood \emph{in expectation}, there still exist many gaps
4: in the understanding of their convergence with \emph{high probability},
5: where the convergence rate has a logarithmic dependency on the desired
6: success probability parameter. In the \emph{heavy-tailed
7: 	noise} setting, where the stochastic gradient noise only has bounded
8: $p$-th moments for some $p\in(1,2]$, existing works could only show
9: bounds \emph{in expectation} for a variant of stochastic gradient
10: descent (SGD) with clipped gradients, or high probability bounds in
11: special cases (such as $p=2$) or with extra assumptions (such as
12: the stochastic gradients having bounded non-central moments). In this
13: work, using a novel analysis framework, we present new and time-optimal
14: (up to logarithmic factors) \emph{high probability} convergence bounds
15: for SGD with clipping under heavy-tailed noise for both convex and
16: non-convex smooth objectives using only minimal assumptions. 
17: \end{abstract}
18: