DCARApr 7

Fine-Grained Power and Energy Attribution on AMD GPU/APU-Based Exascale Nodes

arXiv:2604.0605643.7
Predicted impact top 36% in DC · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses power-aware optimization for exascale computing systems, providing portable guidance for sensor validation, but it is incremental as it builds on existing tools and focuses on specific hardware.

The paper tackles the challenge of accurately attributing power and energy usage to short-lived accelerator activities on exascale GPU/APU systems by developing a methodology to characterize and correct sensor discrepancies, applied to workloads like rocHPL-MxP and HPG-MxP, resulting in node energy reductions of up to 79% on Frontier and similar trends on Portage.

Modern exascale GPU- and APU-based systems provide multiple power and energy sensors, but differences in scope, update rate, timing, and filtering complicate the attribution of short-lived accelerator activity. This paper presents a methodology to characterize and correct these effects on Cray EX systems with AMD Instinct MI250X GPUs (Frontier) and MI300A APUs (Portage). Using controlled square-wave workloads, we quantify update intervals, delay, aliasing, and variability across up to 512 GPUs and 480 APUs with on-chip (rocm-smi/amd-smi) and off-chip Cray Power Management sensors. We reconstruct power from cumulative energy counters to achieve faster response times, validate it against on-chip, off-chip, and node-level sensors, and integrate the resulting streams into a Score-P/PAPI-based tool for time-aligned, phase-level attribution. Applied to rocHPL, rocHPL-MxP, and HPG-MxP, the method separates energy savings due to reduced runtime from changes in power. Mixed precision reduces node energy on Frontier by 79% for rocHPL-MxP and 31% for HPG-MxP, with similar trends on Portage. These results provide portable guidance for sensor validation and power-aware optimization on current and future exascale systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes