abstract:9ab84071b31d20aa.tex

1: \begin{abstract}%

2: In decentralized optimization, $m$ agents form a network and only communicate with their neighbors, which gives advantages in data ownership, privacy, and scalability.

3: At the same time, decentralized stochastic gradient descent (\texttt{SGD}) methods, as popular decentralized algorithms for training large-scale machine learning models, have shown their superiority over centralized counterparts.

4: Distributed stochastic gradient tracking~(\dsgt)~\citep{pu2021distributed} has been recognized as the popular and state-of-the-art decentralized \texttt{SGD} method due to its proper theoretical guarantees.

5: However, the theoretical analysis of \dsgt~\citep{koloskova2021improved} shows that

6: its iteration complexity is $\tilde{\cO} \left(\frac{\bsig^2}{m\mu \varepsilon} + \frac{\sqrt{L}\bsig}{\mu(1 - \lambda_2(W))^{1/2} C_W \sqrt{\varepsilon} }\right)$, where $W$ is a double stochastic mixing matrix that presents the network topology and $ C_W $ is a parameter that depends on $W$.

7: Thus, it indicates that the convergence property of \dsgt~is heavily affected by the topology of the communication network.

8: To overcome the weakness of \dsgt, we resort to the snap-shot gradient tracking skill and propose two novel algorithms.

9: We further justify that the proposed two algorithms are more robust to the topology of communication networks under similar algorithmic structures and the same communication strategy to \dsgt~.

10: Compared with \dsgt, their iteration complexity are $\cO\left( \frac{\bsig^2}{m\mu\varepsilon} + \frac{\sqrt{L}\bsig}{\mu (1 - \lambda_2(W))\sqrt{\varepsilon}} \right)$ and $\cO\left( \frac{\bsig^2}{m\mu \varepsilon} + \frac{\sqrt{L}\bsig}{\mu (1 - \lambda_2(W))^{1/2}\sqrt{\varepsilon}} \right)$ which reduce the impact on network topology (no $C_W$).

11: \end{abstract}

12: