eaa83a6311c1f3ea.tex
1: \begin{abstract}
2: Decentralized SGD can run with low communication costs, but its sparse communication characteristics deteriorate the convergence rate, especially when the number of nodes is large.
3: In decentralized learning settings, communication is assumed to occur on only a given topology, while in many practical cases, the topology merely represents a preferred communication pattern, and connecting to arbitrary nodes is still possible.
4: Previous studies have tried to alleviate the convergence rate degradation in these cases by designing topologies with large spectral gaps.
5: However, the degradation is still significant when the number of nodes is substantial.
6: In this work, we propose \proposed{}.
7: \proposed{} activates only a subset of nodes, and the active nodes fetch the parameters from previous active nodes.
8: Then, the active nodes update their parameters by SGD and perform gossip averaging on a relatively small topology comprising only the active nodes.
9: We show that by activating only a proper number of nodes, \proposed{} can completely alleviate the convergence rate degradation.
10: Furthermore, we propose an efficient hyperparameter-tuning method to search for the appropriate number of nodes to be activated.
11: Experimentally, we showed that \proposed{} can train neural networks more stably and achieve higher accuracy than Decentralized SGD.
12: \end{abstract}
13: