abstract:8416e870a57cb661.tex

1: \begin{abstract}

2: Decentralized learning~(DL) has recently employed local updates to reduce the communication cost for general non-convex optimization problems.

3: Specifically, local updates require each node to perform multiple update steps on the parameters of the local model before communicating with others.

4: However, most existing methods could be highly sensitive to data heterogeneity (i.e., non-iid data distribution) and adversely affected by the stochastic gradient noise.

5: In this paper, we propose DSE-MVR to address these problems.

6: Specifically, DSE-MVR introduces a dual-slow estimation strategy that utilizes the gradient tracking technique to estimate the global accumulated update direction for handling the data heterogeneity problem; also for stochastic noise, the method uses the mini-batch momentum-based variance-reduction technique.

7: We theoretically prove that DSE-MVR can achieve optimal convergence results for general non-convex optimization in both iid and non-iid data distribution settings.

8: In particular, the leading terms in the convergence rates derived by DSE-MVR are independent of the stochastic noise  for large-batches or large partial average intervals (i.e., the number of local update steps).

9: Further, we put forward DSE-SGD and theoretically justify the importance of the dual-slow estimation strategy in the data heterogeneity setting.

10: Finally, we conduct extensive experiments to show the superiority of DSE-MVR against other state-of-the-art approaches.

11: We provide our code here:

12: https://anonymous.4open.science/r/DSE-MVR-32B8/.

13: \end{abstract}

14: