abstract:8f30e83346007c7f.tex

1: \begin{abstract}

2: Recently, the study of heavy-tailed noises in first-order nonconvex

3: stochastic optimization has gotten a lot of attention since it was

4: recognized as a more realistic condition as suggested by many empirical

5: observations. Specifically, the stochastic noise (the difference between

6: the stochastic and true gradient) is considered only to have a finite

7: $\mathfrak{p}$-th moment where $\mathfrak{p}\in\left(1,2\right]$

8: instead of assuming it always satisfies the classical finite variance

9: assumption. To deal with this more challenging setting, people have

10: proposed different algorithms and proved them to converge at an optimal

11: $\mathcal{O}(T^{\frac{1-\mathfrak{p}}{3\mathfrak{p}-2}})$ rate for

12: smooth objectives after $T$ iterations. Notably, all these new-designed

13: algorithms are based on the same technique -- gradient clipping.

14: Naturally, one may want to know whether the clipping method is a necessary

15: ingredient and the only way to guarantee convergence under heavy-tailed

16: noises. In this work, by revisiting the existing Batched Normalized

17: Stochastic Gradient Descent with Momentum (Batched NSGDM) algorithm,

18: we provide the first convergence result under heavy-tailed noises

19: but \textit{without} gradient clipping. Concretely, we prove that

20: Batched NSGDM can achieve the optimal $\mathcal{O}(T^{\frac{1-\mathfrak{p}}{3\mathfrak{p}-2}})$

21: rate even under the relaxed smooth condition. More interestingly,

22: we also establish the first $\mathcal{O}(T^{\frac{1-\mathfrak{p}}{2\mathfrak{p}}})$

23: convergence rate in the case where the tail index $\mathfrak{p}$

24: is unknown in advance, which is arguably the common scenario in practice.

25:

26: \end{abstract}

27: