LG ARNov 1, 2021

GNNear: Accelerating Full-Batch Training of Graph Neural Networks with Near-Memory Processing

Zhe Zhou, Cong Li, Xuechao Wei, Xiaoyang Wang, Guangyu Sun

arXiv:2111.00680v29.938 citations

Originality Incremental advance

AI Analysis

This work addresses the problem of slow and energy-intensive GNN training for researchers and practitioners dealing with large graph data, presenting a hardware solution that is incremental but offers substantial performance gains.

The paper tackles the challenge of inefficient full-batch training for Graph Neural Networks (GNNs) on large graphs by proposing GNNear, an accelerator that uses near-memory processing, achieving 30.8x/2.5x geomean speedup and 79.6x/7.3x higher energy efficiency compared to CPU and GPU.

Recently, Graph Neural Networks (GNNs) have become state-of-the-art algorithms for analyzing non-euclidean graph data. However, to realize efficient GNN training is challenging, especially on large graphs. The reasons are many-folded: 1) GNN training incurs a substantial memory footprint. Full-batch training on large graphs even requires hundreds to thousands of gigabytes of memory. 2) GNN training involves both memory-intensive and computation-intensive operations, challenging current CPU/GPU platforms. 3) The irregularity of graphs can result in severe resource under-utilization and load-imbalance problems. This paper presents a GNNear accelerator to tackle these challenges. GNNear adopts a DIMM-based memory system to provide sufficient memory capacity. To match the heterogeneous nature of GNN training, we offload the memory-intensive Reduce operations to in-DIMM Near-Memory-Engines (NMEs), making full use of the high aggregated local bandwidth. We adopt a Centralized-Acceleration-Engine (CAE) to process the computation-intensive Update operations. We further propose several optimization strategies to deal with the irregularity of input graphs and improve GNNear's performance. Comprehensive evaluations on 16 GNN training tasks demonstrate that GNNear achieves 30.8$\times$/2.5$\times$ geomean speedup and 79.6$\times$/7.3$\times$(geomean) higher energy efficiency compared to Xeon E5-2698-v4 CPU and NVIDIA V100 GPU.

View on arXiv PDF

Similar