1: \begin{abstract}
2: Decentralized learning~(DL) has recently employed local updates to reduce the communication cost for general non-convex optimization problems.
3: Specifically, local updates require each node to perform multiple update steps on the parameters of the local model before communicating with others.
4: However, most existing methods could be highly sensitive to data heterogeneity (i.e., non-iid data distribution) and adversely affected by the stochastic gradient noise.
5: In this paper, we propose DSE-MVR to address these problems.
6: Specifically, DSE-MVR introduces a dual-slow estimation strategy that utilizes the gradient tracking technique to estimate the global accumulated update direction for handling the data heterogeneity problem; also for stochastic noise, the method uses the mini-batch momentum-based variance-reduction technique.
7: We theoretically prove that DSE-MVR can achieve optimal convergence results for general non-convex optimization in both iid and non-iid data distribution settings.
8: In particular, the leading terms in the convergence rates derived by DSE-MVR are independent of the stochastic noise for large-batches or large partial average intervals (i.e., the number of local update steps).
9: Further, we put forward DSE-SGD and theoretically justify the importance of the dual-slow estimation strategy in the data heterogeneity setting.
10: Finally, we conduct extensive experiments to show the superiority of DSE-MVR against other state-of-the-art approaches.
11: We provide our code here:
12: https://anonymous.4open.science/r/DSE-MVR-32B8/.
13: \end{abstract}
14: