abstract:6c1d2fd12af11b7e.tex

1: \begin{abstract}

2:   Methods with adaptive stepsizes, such as \algname{AdaGrad} and \algname{Adam}, are essential for training modern Deep Learning models, especially Large Language Models. Typically, the noise in the stochastic gradients is heavy-tailed for the later ones. Gradient clipping provably helps to achieve good high-probability convergence for such noises. However, despite the similarity between \algname{AdaGrad}/\algname{Adam} and \algname{Clip-SGD}, the high-probability convergence of \algname{AdaGrad}/\algname{Adam} has not been studied in this case. In this work, we prove that \algname{AdaGrad} (and its delayed version) can have provably bad high-probability convergence if the noise is heavy-tailed. To fix this issue, we propose a new version of \algname{AdaGrad} called \algname{Clip-RAdaGradD} (Clipped Reweighted \algname{AdaGrad} with Delay) and prove its high-probability convergence bounds with polylogarithmic dependence on the confidence level for smooth convex/non-convex stochastic optimization with heavy-tailed noise. Our empirical evaluations, including NLP model \revision{fine-tuning}, highlight the superiority of clipped versions of \algname{AdaGrad}/\algname{Adam} in handling the heavy-tailed noise.

3: \end{abstract}

4: