LG PFNov 25, 2024

DF-GNN: Dynamic Fusion Framework for Attention Graph Neural Networks on GPUs

Jiahui Liu, Zhenkun Cai, Zhiyong Chen, Minjie Wang

arXiv:2411.16127v16.42 citationsh-index: 6Has CodeLog

Originality Incremental advance

AI Analysis

This work addresses performance bottlenecks for researchers and practitioners using AT-GNNs on GPUs, offering a significant but incremental optimization over existing systems.

The paper tackles the inefficiency of training Attention Graph Neural Networks (AT-GNNs) on GPUs due to data movement and kernel overhead, proposing DF-GNN, a dynamic kernel fusion framework that achieves up to 7.0× speedup over non-fusion methods and an average 2.16× speedup in end-to-end training.

Attention Graph Neural Networks (AT-GNNs), such as GAT and Graph Transformer, have demonstrated superior performance compared to other GNNs. However, existing GNN systems struggle to efficiently train AT-GNNs on GPUs due to their intricate computation patterns. The execution of AT-GNN operations without kernel fusion results in heavy data movement and significant kernel launch overhead, while fixed thread scheduling in existing GNN kernel fusion strategies leads to sub-optimal performance, redundant computation and unbalanced workload. To address these challenges, we propose a dynamic kernel fusion framework, DF-GNN, for the AT-GNN family. DF-GNN introduces a dynamic bi-level thread scheduling strategy, enabling flexible adjustments to thread scheduling while retaining the benefits of shared memory within the fused kernel. DF-GNN tailors specific thread scheduling for operations in AT-GNNs and considers the performance bottleneck shift caused by the presence of super nodes. Additionally, DF-GNN is integrated with the PyTorch framework for high programmability. Evaluations across diverse GNN models and multiple datasets reveal that DF-GNN surpasses existing GNN kernel optimization works like cuGraph and dgNN, with speedups up to $7.0\times$ over the state-of-the-art non-fusion DGL sparse library. Moreover, it achieves an average speedup of $2.16\times$ in end-to-end training compared to the popular GNN computing framework DGL.

View on arXiv PDF Code

Similar