LGJul 14, 2023

DistTGL: Distributed Memory-Based Temporal Graph Neural Network Training

arXiv:2307.07649v125 citationsh-index: 99
Originality Incremental advance
AI Analysis

This work addresses scalability issues for researchers and practitioners using temporal graph neural networks in distributed settings, representing an incremental improvement over existing solutions.

The paper tackled the problem of scaling memory-based Temporal Graph Neural Networks to distributed GPU clusters, which suffer from accuracy loss and synchronization overhead, by proposing DistTGL, which achieved a 14.5% accuracy improvement and 10.17x training throughput increase over state-of-the-art single-machine methods.

Memory-based Temporal Graph Neural Networks are powerful tools in dynamic graph representation learning and have demonstrated superior performance in many real-world applications. However, their node memory favors smaller batch sizes to capture more dependencies in graph events and needs to be maintained synchronously across all trainers. As a result, existing frameworks suffer from accuracy loss when scaling to multiple GPUs. Evenworse, the tremendous overhead to synchronize the node memory make it impractical to be deployed to distributed GPU clusters. In this work, we propose DistTGL -- an efficient and scalable solution to train memory-based TGNNs on distributed GPU clusters. DistTGL has three improvements over existing solutions: an enhanced TGNN model, a novel training algorithm, and an optimized system. In experiments, DistTGL achieves near-linear convergence speedup, outperforming state-of-the-art single-machine method by 14.5% in accuracy and 10.17x in training throughput.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes