110467e5009f2b3b.tex
1: \begin{abstract}
2: In this work, we study the convergence \emph{in high probability}
3: of clipped gradient methods when the noise distribution has heavy
4: tails, ie., with bounded $p$th moments, for some $1<p\le2$. Prior
5: works in this setting follow the same recipe of using concentration
6: inequalities and an inductive argument with union bound to bound the
7: iterates across all iterations. This method results in an increase
8: in the failure probability by a factor of $T$, where $T$ is the
9: number of iterations. We instead propose a new analysis approach based
10: on bounding the moment generating function of a well chosen supermartingale
11: sequence. We improve the dependency on $T$ in the convergence guarantee
12: for a wide range of algorithms with clipped gradients, including stochastic
13: (accelerated) mirror descent for convex objectives and stochastic
14: gradient descent for nonconvex objectives. This approach naturally
15: allows the algorithms to use time-varying step sizes and clipping
16: parameters when the time horizon is unknown, which appears impossible
17: in prior works. We show that in the case of clipped stochastic mirror
18: descent, problem constants, including the initial distance to the
19: optimum, are not required when setting step sizes and clipping parameters.
20: \end{abstract}
21: