1: \begin{abstract}
2: Recently, the study of heavy-tailed noises in first-order nonconvex
3: stochastic optimization has gotten a lot of attention since it was
4: recognized as a more realistic condition as suggested by many empirical
5: observations. Specifically, the stochastic noise (the difference between
6: the stochastic and true gradient) is considered only to have a finite
7: $\mathfrak{p}$-th moment where $\mathfrak{p}\in\left(1,2\right]$
8: instead of assuming it always satisfies the classical finite variance
9: assumption. To deal with this more challenging setting, people have
10: proposed different algorithms and proved them to converge at an optimal
11: $\mathcal{O}(T^{\frac{1-\mathfrak{p}}{3\mathfrak{p}-2}})$ rate for
12: smooth objectives after $T$ iterations. Notably, all these new-designed
13: algorithms are based on the same technique -- gradient clipping.
14: Naturally, one may want to know whether the clipping method is a necessary
15: ingredient and the only way to guarantee convergence under heavy-tailed
16: noises. In this work, by revisiting the existing Batched Normalized
17: Stochastic Gradient Descent with Momentum (Batched NSGDM) algorithm,
18: we provide the first convergence result under heavy-tailed noises
19: but \textit{without} gradient clipping. Concretely, we prove that
20: Batched NSGDM can achieve the optimal $\mathcal{O}(T^{\frac{1-\mathfrak{p}}{3\mathfrak{p}-2}})$
21: rate even under the relaxed smooth condition. More interestingly,
22: we also establish the first $\mathcal{O}(T^{\frac{1-\mathfrak{p}}{2\mathfrak{p}}})$
23: convergence rate in the case where the tail index $\mathfrak{p}$
24: is unknown in advance, which is arguably the common scenario in practice.
25:
26: \end{abstract}
27: