Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration
This addresses the efficiency and resource constraints for large-scale graph embedding applications like recommendation systems, though it is incremental as it builds on existing heterogeneous computing approaches.
The paper tackled the scalability bottleneck in billion-scale graph embedding systems by proposing Legend, a lightweight heterogeneous system that optimizes CPU-GPU-SSD integration, achieving up to 4.8x speedup over state-of-the-art systems and matching performance with only one quarter of the GPUs.
Graph embeddings map graph nodes to continuous vectors and are foundational to community detection, recommendation, and many scientific applications. At billion-scale, however, existing graph embedding systems face a trade-off: they either rely on large in-memory footprints across many GPUs (limited scalability) or repeatedly stream data from disk (incurring severe I/O overhead and low GPU utilization). In this paper, we propose Legend, a lightweight heterogeneous system for graph embedding that systematically redesigns data management across CPU, GPU, and NVMe SSD resources. Legend combines three practical ideas: (1) a prefetch-friendly embedding-loading order that lets GPUs efficiently prefetch necessary embeddings directly from NVMe SSD with low I/O amplification; (2) a high-throughput GPU-SSD direct-access driver tuned for the access patterns of embedding training; and (3) a customized parallel execution strategy that maximizes GPU utilization. Together, these components let Legend store and stream vast embedding data without overprovisioning GPU memory or suffering I/O stalls. Extensive experiments on billion-scale graphs demonstrate that Legend speeds up end-to-end workloads by up to 4.8x versus state-of-the-art systems, and matches their performance on the largest workloads while using only one quarter of the GPUs.