DCApr 29

MPI Malleability Validation under Replayed Real-World HPC Conditions

arXiv:2604.2657644.75 citations
AI Analysis

For HPC administrators and users, it provides a validation methodology that addresses skepticism about dynamic resource management feasibility in real-world clusters.

The paper validates MPI malleability under real-world HPC conditions by replaying workload logs on a 125-node partition of Marenostrum 5, showing a 27% reduction in malleable workload time without delaying baseline jobs.

Dynamic Resource Management (DRM) techniques can be leveraged to maximize throughput and resource utilization in computational clusters. Although DRM has been extensively studied through analytical workloads and simulations, skepticism persists among end administrators and users regarding their feasibility under real-world conditions. To address this problem, we propose a novel methodology for validating DRM techniques, such as malleability, in realistic scenarios that reproduce actual cluster conditions of jobs and users by replaying workload logs on a High-performance Computing (HPC) infrastructure. Our methodology is capable of adapting the workload to the target cluster. We evaluate our methodology in a malleability-enabled 125-node partition of the Marenostrum 5 supercomputer. Our results validate the proposed method and assess the benefits of MPI malleability on a novel use case of a pioneer user of malleability (our "PhD Student"): parallel efficiency-aware malleability reduced a malleable workload time by 27% without delaying the baseline workload, although introducing queueing delays for individual jobs, but maintaining the resource utilization rate.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes