abstract:e79a3d3b02f6cad7.tex

1: \begin{abstract}

2:

3: Gradient compression is a popular technique for improving communication complexity of stochastic first-order methods in distributed training of machine learning models. However, the existing works consider only with-replacement sampling of stochastic gradients. In contrast, it is well-known in practice and recently confirmed in theory that stochastic methods based on without-replacement sampling, e.g., Random Reshuffling (\algname{RR}) method, perform better than ones that sample the gradients with-replacement. In this work, we close this gap in the literature and provide the first analysis of methods with gradient compression and without-replacement sampling. We first develop a na\"ive combination of random reshuffling with gradient compression (\algname{Q-RR}). Perhaps surprisingly, but the theoretical analysis of \algname{Q-RR} does not show any benefits of using \algname{RR}. Our extensive numerical experiments confirm this phenomenon. This happens due to the additional compression variance. To reveal the true advantages of \algname{RR} in the distributed learning with compression, we propose a new method called \algname{DIANA-RR} that reduces the compression variance and has provably better convergence rates than existing counterparts with with-replacement sampling of stochastic gradients. Next, to have a better fit to Federated Learning applications, we incorporate local computation, i.e., we propose and analyze the variants of \algname{Q-RR} and \algname{DIANA-RR} -- \algname{Q-NASTYA} and \algname{DIANA-NASTYA} that use local gradient steps and different local and global stepsizes. Finally, we conducted several numerical experiments to illustrate our theoretical results.

4:

5: \end{abstract}

6: