abstract:3d10421f09afff9f.tex

1: \begin{abstract}

2: 	Error accumulation is effective for gradient sparsification in distributed settings: initially-unselected gradient entries are eventually selected as their accumulated error exceeds a certain level. The accumulation essentially behaves as a scaling of the learning rate for the selected entries. Although this property prevents the slow-down of lateral movements in distributed gradient descent, it can deteriorate convergence in some settings. This work proposes a novel sparsification scheme that controls the learning rate scaling of error accumulation. The development of this scheme follows two major steps: first, gradient sparsification is formulated as an inverse probability (inference) problem, and the Bayesian optimal sparsification mask is derived as a  maximum-a-posteriori estimator. %The prior belief on aggregated gradients is then determined by making an analogy between our Bayesian sparsification framework and the classical \tpK algorithm.

3: 	Using the prior distribution inherited from \tpK, we derive a new sparsification algorithm which can be interpreted as a regularized form of \tpK. We call this algorithm \textit{regularized} \tpK (\textsc{RegTop-}$k$). It utilizes past aggregated gradients to evaluate posterior statistics of the next aggregation. It then prioritizes the local accumulated gradient entries based on these posterior statistics. We validate our derivation through numerical experiments. In distributed linear regression, it is observed that while \tpK remains at a fixed distance from the global optimum, \rgtpK converges to the global optimum at significantly higher compression ratios. We further demonstrate the generalization of this observation by employing \rgtpK in distributed training of ResNet-18 on CIFAR-10, where it noticeably outperforms \tpK.

4: \end{abstract}

5: