Kareus: Joint Reduction of Dynamic and Static Energy in Large Model Training
This addresses the energy efficiency bottleneck in AI training, which is critical for reducing costs and environmental impact, and is an incremental improvement over prior single-aspect optimizations.
The paper tackles the problem of high energy consumption in large model training by jointly optimizing dynamic and static energy through fine-grained kernel scheduling and frequency scaling, resulting in up to 28.3% energy reduction at the same training time or up to 27.5% time reduction at the same energy consumption.
The computing demand of AI is growing at an unprecedented rate, but energy supply is not keeping pace. As a result, energy has become an expensive, contended resource that requires explicit management and optimization. Although recent works have made significant progress in large model training optimization, they focus only on a single aspect of energy consumption: dynamic or static energy. We find that fine-grained kernel scheduling and frequency scaling jointly and interdependently impact both dynamic and static energy consumption. Based on this finding, we design Kareus, a training system that pushes the time--energy tradeoff frontier by optimizing both aspects. Kareus decomposes the intractable joint optimization problem into local, partition-based subproblems. It then uses a multi-pass multi-objective optimization algorithm to find execution schedules that push the time--energy tradeoff frontier. Compared to the state of the art, Kareus reduces training energy by up to 28.3% at the same training time, or reduces training time by up to 27.5% at the same energy consumption.