DR-CircuitGNN: Training Acceleration of Heterogeneous Circuit Graph Neural Network on GPUs
This work addresses performance bottlenecks for EDA designers using HGNNs, offering significant training acceleration with minimal accuracy loss, though it is incremental as it optimizes existing methods rather than introducing a new paradigm.
The paper tackles the high computational cost of training Heterogeneous Graph Neural Networks (HGNNs) for Electronic Design Automation (EDA) circuit graphs by proposing DR-CircuitGNN, a GPU kernel design with optimizations like Dynamic-ReLU and SpMM kernels, achieving up to 4.09x speedup in backward propagation compared to state-of-the-art methods.
The increasing scale and complexity of integrated circuit design have led to increased challenges in Electronic Design Automation (EDA). Graph Neural Networks (GNNs) have emerged as a promising approach to assist EDA design as circuits can be naturally represented as graphs. While GNNs offer a foundation for circuit analysis, they often fail to capture the full complexity of EDA designs. Heterogeneous Graph Neural Networks (HGNNs) can better interpret EDA circuit graphs as they capture both topological relationships and geometric features. However, the improved representation capability comes at the cost of even higher computational complexity and processing cost due to their serial module-wise message-passing scheme, creating a significant performance bottleneck. In this paper, we propose DR-CircuitGNN, a fast GPU kernel design by leveraging row-wise sparsity-aware Dynamic-ReLU and optimizing SpMM kernels during heterogeneous message-passing to accelerate HGNNs training on EDA-related circuit graph datasets. To further enhance performance, we propose a parallel optimization strategy that maximizes CPU-GPU concurrency by concurrently processing independent subgraphs using multi-threaded CPU initialization and GPU kernel execution via multiple cudaStreams. Our experiments show that on three representative CircuitNet designs (small, medium, large), the proposed method can achieve up to 3.51x and 4.09x speedup compared to the SOTA for forward and backward propagation, respectively. On full-size CircuitNet and sampled Mini-CircuitNet, our parallel design enables up to 2.71x speed up over the official DGL implementation cuSPARSE with negligible impact on correlation scores and error rates.