abstract:eaa83a6311c1f3ea.tex

1: \begin{abstract}

2: Decentralized SGD can run with low communication costs, but its sparse communication characteristics deteriorate the convergence rate, especially when the number of nodes is large.

3: In decentralized learning settings, communication is assumed to occur on only a given topology, while in many practical cases, the topology merely represents a preferred communication pattern, and connecting to arbitrary nodes is still possible.

4: Previous studies have tried to alleviate the convergence rate degradation in these cases by designing topologies with large spectral gaps.

5: However, the degradation is still significant when the number of nodes is substantial.

6: In this work, we propose \proposed{}.

7: \proposed{} activates only a subset of nodes, and the active nodes fetch the parameters from previous active nodes.

8: Then, the active nodes update their parameters by SGD and perform gossip averaging on a relatively small topology comprising only the active nodes.

9: We show that by activating only a proper number of nodes, \proposed{} can completely alleviate the convergence rate degradation.

10: Furthermore, we propose an efficient hyperparameter-tuning method to search for the appropriate number of nodes to be activated.

11: Experimentally, we showed that \proposed{} can train neural networks more stably and achieve higher accuracy than Decentralized SGD.

12: \end{abstract}

13: