4583811deb3f10e3.tex
1: \begin{abstract}
2: Training Graph Neural Networks, on graphs containing billions of vertices and edges, at scale using minibatch sampling poses a key challenge: strong-scaling graphs and training examples results in lower compute and higher communication volume and potential performance loss. DistGNN-MB employs a novel {\em Historical Embedding Cache} combined with compute-communication overlap to address this challenge. On a $32$-node ($64$-socket) cluster of 3rd generation Intel\textregistered \hspace{0.01cm} Xeon\textregistered \hspace{0.01cm} Scalable Processors with 36 cores per socket, DistGNN-MB trains $3$-layer GraphSAGE and GAT models on OGBN-Papers100M to convergence with epoch times of $2$ seconds and $4.9$ seconds, respectively, on $32$ compute nodes. At this scale, DistGNN-MB trains GraphSAGE $5.2\times$ faster than the widely-used DistDGL. 
3: DistGNN-MB trains GraphSAGE and GAT $10\times$ and $17.2\times$ faster, respectively, as compute nodes scale from $2$ to $32$.
4: 
5: %In an extreme-scale experiment, we observe that DistGNN-MB takes \revise{EP minutes} per epoch to train a web-graph with $3.5$B vertices and $28$B edges on $128$-node ($256$-socket) cluster.
6: \end{abstract}
7: