Multi-GPU Quantum Circuit Simulation and the Impact of Network Performance
This work addresses performance bottlenecks for researchers and developers in quantum computing who rely on classical simulations for algorithm validation and hardware design, though it is incremental as it focuses on benchmarking existing interconnect technologies.
The paper tackled the bottleneck of inter-GPU communications in multi-GPU quantum circuit simulations by benchmarking various interconnect technologies, showing that interconnect advances led to over 16X performance improvements in time to solution, compared to 4.5X speedups from GPU architecture improvements.
As is intrinsic to the fundamental goal of quantum computing, classical simulation of quantum algorithms is notoriously demanding in resource requirements. Nonetheless, simulation is critical to the success of the field and a requirement for algorithm development and validation, as well as hardware design. GPU-acceleration has become standard practice for simulation, and due to the exponential scaling inherent in classical methods, multi-GPU simulation can be required to achieve representative system sizes. In this case, inter-GPU communications can bottleneck performance. In this work, we present the introduction of MPI into the QED-C Application-Oriented Benchmarks to facilitate benchmarking on HPC systems. We review the advances in interconnect technology and the APIs for multi-GPU communication. We benchmark using a variety of interconnect paths, including the recent NVIDIA Grace Blackwell NVL72 architecture that represents the first product to expand high-bandwidth GPU-specialized interconnects across multiple nodes. We show that while improvements to GPU architecture have led to speedups of over 4.5X across the last few generations of GPUs, advances in interconnect performance have had a larger impact with over 16X performance improvements in time to solution for multi-GPU simulations.