1: \begin{abstract}
2: %\tpK sparsification invokes error accumulation to implicitly scale the learning rate and prevent the slow-down of distributed gradient descent. This property, however, can deteriorate convergence.
3: Error accumulation is an essential component of the \tpK sparsification method in distributed gradient descent. It implicitly scales the learning rate and prevents the slow-down of lateral movement, but it can also deteriorate convergence. This paper proposes a novel sparsification algorithm called \textit{regularized} \tpK (\textsc{RegTop-}$k$) that controls the learning rate scaling of error accumulation. The algorithm is developed by looking at the gradient sparsification as an inference problem and determining a Bayesian optimal sparsification mask via maximum-a-posteriori estimation. It utilizes past aggregated gradients to evaluate posterior statistics, based on which it prioritizes the local gradient entries. Numerical experiments with ResNet-18 on CIFAR-10 show that at $0.1\%$ sparsification, \rgtpK achieves about $8\%$ higher accuracy than standard \tpK.
4: \end{abstract}
5: