d116c129f37e6eb1.tex
1: \begin{abstract}
2: Distributed training of GNNs enables learning on massive graphs (e.g., social and e-commerce networks) that exceed the storage and computational capacity of a single machine. 
3: To reach performance comparable to centralized training, distributed frameworks focus on maximally recovering cross-instance node dependencies with either
4: communication across instances or periodic fallback to centralized training,  
5: which create overhead and limit the framework scalability. 
6: In this work, we present a simplified framework for distributed GNN training that does not rely on the aforementioned costly operations, and has improved scalability, convergence speed and performance over the state-of-the-art approaches.
7: Specifically, our framework (1) assembles independent trainers, each of which asynchronously learns a local model on locally-available parts of the training graph, and (2) only conducts periodic (time-based) model aggregation to synchronize the local models. 
8: Backed by our theoretical analysis, instead of maximizing the recovery of cross-instance node dependencies---which has been considered the key behind closing the performance gap between model aggregation and centralized training---, our framework leverages randomized assignment of nodes or super-nodes (i.e., collections of original nodes) to partition the training graph such that it improves data uniformity and  
9: minimizes the discrepancy of gradient and loss function across instances. 
10: In our experiments on {social and e-commerce} networks with up to 1.3 billion edges, our proposed \methodrnd{} and \method{}
11: approaches---despite using \textit{less} training data---achieve state-of-the-art performance and 2.31x speedup compared to the fastest baseline,
12: and show better robustness to trainer failures.
13: 
14: 
15: \end{abstract}
16: