abstract:a30a5c2dc4034a57.tex

1: \begin{abstract}

2: In this work, we describe a generic approach to show convergence

3: with high probability for both stochastic convex and non-convex optimization

4: with sub-Gaussian noise. In previous works for convex optimization,

5: either the convergence is only in expectation or the bound depends

6: on the diameter of the domain. Instead, we show high probability convergence

7: with bounds depending on the initial distance to the optimal solution.

8: The algorithms use step sizes analogous to the standard settings and

9: are universal to Lipschitz functions, smooth functions, and their

10: linear combinations. This method can be applied to the non-convex

11: case. We demonstrate an $O((1+\sigma^{2}\log(1/\delta))/T+\sigma/\sqrt{T})$

12: convergence rate when the number of iterations $T$ is known and an

13: $O((1+\sigma^{2}\log(T/\delta))/\sqrt{T})$ convergence rate when

14: $T$ is unknown for  SGD, where $1-\delta$ is the desired success

15: probability. These bounds improve over existing bounds in the literature.

16: Additionally, we demonstrate that our techniques can be used to obtain

17: high probability bound for AdaGrad-Norm \citep{ward2019adagrad} that

18: removes the bounded gradients assumption from previous works. Furthermore,

19: our technique for AdaGrad-Norm extends to the standard per-coordinate

20: AdaGrad algorithm \citep{duchi2011adaptive}, providing the first

21: noise-adapted high probability convergence for AdaGrad.

22: \end{abstract}

23: