abstract:335065815a5715fc.tex

1: \begin{abstract}

2: Motivated by the increasing popularity and importance of large-scale training under differential privacy (DP) constraints, we study distributed gradient methods with {\em gradient clipping}, i.e., clipping applied to the gradients computed from local information at the nodes. While gradient clipping is an essential tool for injecting formal DP guarantees into gradient-based methods~\citep{abadi2016deep}, it also induces bias which causes serious convergence issues specific to the distributed setting. Inspired by  recent progress in the error-feedback literature which is focused on taming the bias/error introduced by communication compression operators such as Top-$k$~\citep{richtarik2021ef21}, and  mathematical similarities between the clipping operator and contractive compression operators, we design \algname{Clip21} -- the first provably effective and practically useful error feedback mechanism for distributed methods with gradient clipping. We prove that our method converges at the same $\cO(\nicefrac{1}{K})$ rate as distributed gradient descent in the smooth nonconvex regime, which improves the previous best $\cO(\nicefrac{1}{\sqrt{K}})$ rate which was obtained under significantly stronger assumptions.

3: Our method converges significantly faster in practice than competing methods.

4: \end{abstract}

5: