1: \begin{abstract}
2: In this work, we describe a generic approach to show convergence
3: with high probability for both stochastic convex and non-convex optimization
4: with sub-Gaussian noise. In previous works for convex optimization,
5: either the convergence is only in expectation or the bound depends
6: on the diameter of the domain. Instead, we show high probability convergence
7: with bounds depending on the initial distance to the optimal solution.
8: The algorithms use step sizes analogous to the standard settings and
9: are universal to Lipschitz functions, smooth functions, and their
10: linear combinations. This method can be applied to the non-convex
11: case. We demonstrate an $O((1+\sigma^{2}\log(1/\delta))/T+\sigma/\sqrt{T})$
12: convergence rate when the number of iterations $T$ is known and an
13: $O((1+\sigma^{2}\log(T/\delta))/\sqrt{T})$ convergence rate when
14: $T$ is unknown for SGD, where $1-\delta$ is the desired success
15: probability. These bounds improve over existing bounds in the literature.
16: Additionally, we demonstrate that our techniques can be used to obtain
17: high probability bound for AdaGrad-Norm \citep{ward2019adagrad} that
18: removes the bounded gradients assumption from previous works. Furthermore,
19: our technique for AdaGrad-Norm extends to the standard per-coordinate
20: AdaGrad algorithm \citep{duchi2011adaptive}, providing the first
21: noise-adapted high probability convergence for AdaGrad.
22: \end{abstract}
23: