Wattchmen: Watching the Wattchers -- High Fidelity, Flexible GPU Energy Modeling
This addresses energy constraints in GPU-rich HPC systems by providing a more accurate and flexible tool for energy modeling, though it is incremental as it builds on prior work like AccelWattch and Guser.
The paper tackles the problem of inaccurate and inflexible GPU energy attribution in HPC systems by proposing Wattchmen, a methodology that reduces mean absolute percent error (MAPE) to 14% on V100 GPUs and enables energy reductions of up to 35% in applications like Backprop and QMCPACK.
Modern GPU-rich HPC systems are increasingly becoming energy-constrained. Thus, understanding an application's energy consumption becomes essential. Unfortunately, current GPU energy attribution techniques are either inaccurate, inflexible, or outdated. Therefore, we propose Wattchmen, a flexible methodology for measuring, attributing, and predicting GPU energy consumption. We construct a per-instruction energy model using a diverse set of microbenchmarks to systematically quantify the energy consumption of GPU instructions, enabling finer-grain prediction and energy consumption breakdowns for applications. Compared with the state-of-the-art systems like AccelWattch (32%) and Guser (25%), across 16 popular GPGPU, graph analytics, HPC, and ML workloads, Wattchmen reduces the mean absolute percent error (MAPE) to 14% on V100 GPUs. Furthermore, we show that Wattchmen provides similar MAPEs for water-cooled V100s (15%) and extends to later architectures, including air-cooled A100 (11%) and H100 (12%) GPUs. Finally, to further demonstrate Wattchmen's value, we apply it to applications such as Backprop and QMCPACK, where Wattchmen's insights enable energy reductions of up to 35%.