DCAug 26, 2024
Exploring GPU-to-GPU Communication: Insights into Supercomputer InterconnectsDaniele De Sensi, Lorenzo Pichetti, Flavio Vella et al.
Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This paper comprehensively characterizes three supercomputers - Alps, Leonardo, and LUMI - each with a unique architecture and design. We focus on performance evaluation of intra-node and inter-node interconnects on up to 4096 GPUs, using a mix of intra-node and inter-node benchmarks. By analyzing its limitations and opportunities, we aim to offer practical guidance to researchers, system architects, and software developers dealing with multi-GPU supercomputing. Our results show that there is untapped bandwidth, and there are still many opportunities for optimization, ranging from network to software optimization.
86.1PLASM-PHMar 17
Accelerating the Particle-In-Cell code ECsim with OpenACCElisabetta Boella, Nitin Shukla, Filippo Spiga et al.
The Particle-In-Cell (PIC) method is a computational technique widely used in plasma physics to model plasmas at the kinetic level. In this work, we present our effort to prepare the semi-implicit energy-conserving PIC code ECsim for exascale architectures. To achieve this, we adopted a pragma-based acceleration strategy using OpenACC, which enables high performance while requiring minimal code restructuring. On the pre-exascale Leonardo system, the accelerated code achieves a $5 \times$ speedup and a $3 \times$ reduction in energy consumption compared to the CPU reference code. Performance comparisons across multiple NVIDIA GPU generations show substantial benefits from the GH200 unified memory architecture. Finally, strong and weak scaling tests on Leonardo demonstrate efficiency of $70 \%$ and $78 \%$ up to 64 and 1024 GPUs, respectively.