DC LG PFOct 30, 2024

MassiveGNN: Efficient Training via Prefetching for Massively Connected Distributed Graphs

Aishwarya Sarkar, Sayan Ghosh, Nathan R. Tallent, Ali Jannesari

arXiv:2410.22697v25.15 citationsh-index: 21Has CodeCLUSTER

Originality Incremental advance

AI Analysis

This work addresses performance bottlenecks in distributed GNN training for large-scale graph data, representing an incremental improvement over existing frameworks.

The paper tackled the communication overhead and load imbalance in distributed Graph Neural Network training on massively connected graphs by introducing a parameterized continuous prefetch and eviction scheme, achieving about 15-40% improvement in end-to-end training performance on the OGB datasets using the Perlmutter supercomputer.

Graph Neural Networks (GNN) are indispensable in learning from graph-structured data, yet their rising computational costs, especially on massively connected graphs, pose significant challenges in terms of execution performance. To tackle this, distributed-memory solutions such as partitioning the graph to concurrently train multiple replicas of GNNs are in practice. However, approaches requiring a partitioned graph usually suffer from communication overhead and load imbalance, even under optimal partitioning and communication strategies due to irregularities in the neighborhood minibatch sampling. This paper proposes practical trade-offs for improving the sampling and communication overheads for representation learning on distributed graphs (using popular GraphSAGE architecture) by developing a parameterized continuous prefetch and eviction scheme on top of the state-of-the-art Amazon DistDGL distributed GNN framework, demonstrating about 15-40% improvement in end-to-end training performance on the National Energy Research Scientific Computing Center's (NERSC) Perlmutter supercomputer for various OGB datasets.

View on arXiv PDF Code

Similar