GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism
This addresses scalability issues for researchers and practitioners training GNNs on large graphs, representing an incremental improvement over existing parallel training methods.
The paper tackles the problem of redundant work in data-parallel mini-batch training for graph neural networks (GNNs) on large graphs by introducing split parallelism, a hybrid parallel paradigm that splits sampling, loading, and training across GPUs to avoid overlap, and shows that GSplit outperforms state-of-the-art systems like DGL, Quiver, and P³.
Graph neural networks (GNNs), an emerging class of machine learning models for graphs, have gained popularity for their superior performance in various graph analytical tasks. Mini-batch training is commonly used to train GNNs on large graphs, and data parallelism is the standard approach to scale mini-batch training across multiple GPUs. Data parallel approaches contain redundant work as subgraphs sampled by different GPUs contain significant overlap. To address this issue, we introduce a hybrid parallel mini-batch training paradigm called split parallelism. Split parallelism avoids redundant work by splitting the sampling, loading, and training of each mini-batch across multiple GPUs. Split parallelism, however, introduces communication overheads that can be more than the savings from removing redundant work. We further present a lightweight partitioning algorithm that probabilistically minimizes these overheads. We implement split parallelism in GSplit and show that it outperforms state-of-the-art mini-batch training systems like DGL, Quiver, and $P^3$.