abstract:d116c129f37e6eb1.tex

1: \begin{abstract}

2: Distributed training of GNNs enables learning on massive graphs (e.g., social and e-commerce networks) that exceed the storage and computational capacity of a single machine.

3: To reach performance comparable to centralized training, distributed frameworks focus on maximally recovering cross-instance node dependencies with either

4: communication across instances or periodic fallback to centralized training,

5: which create overhead and limit the framework scalability.

6: In this work, we present a simplified framework for distributed GNN training that does not rely on the aforementioned costly operations, and has improved scalability, convergence speed and performance over the state-of-the-art approaches.

7: Specifically, our framework (1) assembles independent trainers, each of which asynchronously learns a local model on locally-available parts of the training graph, and (2) only conducts periodic (time-based) model aggregation to synchronize the local models.

8: Backed by our theoretical analysis, instead of maximizing the recovery of cross-instance node dependencies---which has been considered the key behind closing the performance gap between model aggregation and centralized training---, our framework leverages randomized assignment of nodes or super-nodes (i.e., collections of original nodes) to partition the training graph such that it improves data uniformity and

9: minimizes the discrepancy of gradient and loss function across instances.

10: In our experiments on {social and e-commerce} networks with up to 1.3 billion edges, our proposed \methodrnd{} and \method{}

11: approaches---despite using \textit{less} training data---achieve state-of-the-art performance and 2.31x speedup compared to the fastest baseline,

12: and show better robustness to trainer failures.

13:

14:

15: \end{abstract}

16: