LGAIDCAug 12, 2022

Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training

arXiv:2208.06102v2168 citationsh-index: 41
Originality Incremental advance
AI Analysis

This work addresses the energy consumption problem for researchers and practitioners in machine learning, offering a novel method to balance energy and performance, though it builds on existing optimization concepts.

The paper tackles the problem of inefficient energy usage in deep neural network training by proposing Zeus, an optimization framework that automatically finds optimal configurations, resulting in energy efficiency improvements of 15.3% to 75.8% for diverse workloads.

Training deep neural networks (DNNs) is becoming increasingly more resource- and energy-intensive every year. Unfortunately, existing works primarily focus on optimizing DNN training for faster completion, often without considering the impact on energy efficiency. In this paper, we observe that common practices to improve training performance can often lead to inefficient energy usage. More importantly, we demonstrate that there is a tradeoff between energy consumption and performance optimization. To this end, we propose Zeus, an optimization framework to navigate this tradeoff by automatically finding optimal job- and GPU-level configurations for recurring DNN training jobs. Zeus uses an online exploration-exploitation approach in conjunction with just-in-time energy profiling, averting the need for expensive offline measurements, while adapting to data drifts over time. Our evaluation shows that Zeus can improve the energy efficiency of DNN training by 15.3%-75.8% for diverse workloads.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes