DCLGMar 24, 2023

GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism

arXiv:2303.13775v314 citationsh-index: 24
Originality Incremental advance
AI Analysis

This addresses scalability issues for researchers and practitioners training GNNs on large graphs, representing an incremental improvement over existing parallel training methods.

The paper tackles the problem of redundant work in data-parallel mini-batch training for graph neural networks (GNNs) on large graphs by introducing split parallelism, a hybrid parallel paradigm that splits sampling, loading, and training across GPUs to avoid overlap, and shows that GSplit outperforms state-of-the-art systems like DGL, Quiver, and P³.

Graph neural networks (GNNs), an emerging class of machine learning models for graphs, have gained popularity for their superior performance in various graph analytical tasks. Mini-batch training is commonly used to train GNNs on large graphs, and data parallelism is the standard approach to scale mini-batch training across multiple GPUs. Data parallel approaches contain redundant work as subgraphs sampled by different GPUs contain significant overlap. To address this issue, we introduce a hybrid parallel mini-batch training paradigm called split parallelism. Split parallelism avoids redundant work by splitting the sampling, loading, and training of each mini-batch across multiple GPUs. Split parallelism, however, introduces communication overheads that can be more than the savings from removing redundant work. We further present a lightweight partitioning algorithm that probabilistically minimizes these overheads. We implement split parallelism in GSplit and show that it outperforms state-of-the-art mini-batch training systems like DGL, Quiver, and $P^3$.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes