Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects
It provides practical guidance for researchers and developers working with multi-GPU supercomputers, but is incremental as it focuses on benchmarking existing systems.
This paper characterized GPU-to-GPU communication on three supercomputers (Alps, Leonardo, LUMI) to evaluate performance and identify bottlenecks, finding untapped bandwidth and optimization opportunities.
Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This paper comprehensively characterizes three supercomputers - Alps, Leonardo, and LUMI - each with a unique architecture and design. We focus on performance evaluation of intra-node and inter-node interconnects on up to 4096 GPUs, using a mix of intra-node and inter-node benchmarks. By analyzing its limitations and opportunities, we aim to offer practical guidance to researchers, system architects, and software developers dealing with multi-GPU supercomputing. Our results show that there is untapped bandwidth, and there are still many opportunities for optimization, ranging from network to software optimization.