LGMar 25, 2025

PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch

arXiv:2503.19779v13 citationsh-index: 19
Originality Incremental advance
AI Analysis

This addresses performance bottlenecks for PyTorch users on NVIDIA GPUs, offering an incremental optimization to existing compiler support.

The paper tackles the challenge of CUDA Graphs often hurting performance in PyTorch due to static structure and data copy overheads, introducing PyGraph to automatically optimize and deploy CUDA Graphs, resulting in substantial performance improvements over PyTorch2 across various benchmarks.

CUDA Graphs -- a recent hardware feature introduced for NVIDIA GPUs -- aim to reduce CPU launch overhead by capturing and launching a series of GPU tasks (kernels) as a DAG. However, deploying CUDA Graphs faces several challenges today due to the static structure of a graph. It also incurs performance overhead due to data copy. In fact, we show a counter-intuitive result -- deploying CUDA Graphs hurts performance in many cases. We introduce PyGraph, a novel approach to automatically harness the power of CUDA Graphs within PyTorch2. Driven by three key observations, PyGraph embodies three novel optimizations: it enables wider deployment of CUDA Graphs, reduces GPU kernel parameter copy overheads, and selectively deploys CUDA Graphs based on a cost-benefit analysis. PyGraph seamlessly integrates with PyTorch2's compilation toolchain, enabling efficient use of CUDA Graphs without manual modifications to the code. We evaluate PyGraph across various machine learning benchmarks, demonstrating substantial performance improvements over PyTorch2.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes