1: \begin{abstract}%
2: In decentralized optimization, $m$ agents form a network and only communicate with their neighbors, which gives advantages in data ownership, privacy, and scalability.
3: At the same time, decentralized stochastic gradient descent (\texttt{SGD}) methods, as popular decentralized algorithms for training large-scale machine learning models, have shown their superiority over centralized counterparts.
4: Distributed stochastic gradient tracking~(\dsgt)~\citep{pu2021distributed} has been recognized as the popular and state-of-the-art decentralized \texttt{SGD} method due to its proper theoretical guarantees.
5: However, the theoretical analysis of \dsgt~\citep{koloskova2021improved} shows that
6: its iteration complexity is $\tilde{\cO} \left(\frac{\bsig^2}{m\mu \varepsilon} + \frac{\sqrt{L}\bsig}{\mu(1 - \lambda_2(W))^{1/2} C_W \sqrt{\varepsilon} }\right)$, where $W$ is a double stochastic mixing matrix that presents the network topology and $ C_W $ is a parameter that depends on $W$.
7: Thus, it indicates that the convergence property of \dsgt~is heavily affected by the topology of the communication network.
8: To overcome the weakness of \dsgt, we resort to the snap-shot gradient tracking skill and propose two novel algorithms.
9: We further justify that the proposed two algorithms are more robust to the topology of communication networks under similar algorithmic structures and the same communication strategy to \dsgt~.
10: Compared with \dsgt, their iteration complexity are $\cO\left( \frac{\bsig^2}{m\mu\varepsilon} + \frac{\sqrt{L}\bsig}{\mu (1 - \lambda_2(W))\sqrt{\varepsilon}} \right)$ and $\cO\left( \frac{\bsig^2}{m\mu \varepsilon} + \frac{\sqrt{L}\bsig}{\mu (1 - \lambda_2(W))^{1/2}\sqrt{\varepsilon}} \right)$ which reduce the impact on network topology (no $C_W$).
11: \end{abstract}
12: