8416e870a57cb661.tex
1: \begin{abstract}
2: Decentralized learning~(DL) has recently employed local updates to reduce the communication cost for general non-convex optimization problems.
3: Specifically, local updates require each node to perform multiple update steps on the parameters of the local model before communicating with others.
4: However, most existing methods could be highly sensitive to data heterogeneity (i.e., non-iid data distribution) and adversely affected by the stochastic gradient noise. 
5: In this paper, we propose DSE-MVR to address these problems.  
6: Specifically, DSE-MVR introduces a dual-slow estimation strategy that utilizes the gradient tracking technique to estimate the global accumulated update direction for handling the data heterogeneity problem; also for stochastic noise, the method uses the mini-batch momentum-based variance-reduction technique. 
7: We theoretically prove that DSE-MVR can achieve optimal convergence results for general non-convex optimization in both iid and non-iid data distribution settings. 
8: In particular, the leading terms in the convergence rates derived by DSE-MVR are independent of the stochastic noise  for large-batches or large partial average intervals (i.e., the number of local update steps). 
9: Further, we put forward DSE-SGD and theoretically justify the importance of the dual-slow estimation strategy in the data heterogeneity setting.
10: Finally, we conduct extensive experiments to show the superiority of DSE-MVR against other state-of-the-art approaches. 
11: We provide our code here:
12: https://anonymous.4open.science/r/DSE-MVR-32B8/.
13: \end{abstract}
14: