Dillon Jensen

1paper

1 Paper

54.2ARApr 16
EasyRider: Mitigating Power Transients in Datacenter-Scale Training Workloads

Dillon Jensen, Obi Nnorom, Grant Wilkins et al.

Large-scale AI model training workloads use thousands of GPUs operating in tightly synchronized loops. During synchronous communication, start-up, shut-down, and checkpointing, GPU power consumption can swing from peak to idle within milliseconds. These large and rapid load swings endanger grid infrastructure as they induce steep power ramp rates, voltage and frequency shifts, and reactive power transients that can damage transformers, converters, and protection equipment. To solve this problem, we introduce EasyRider, a power architecture to mitigate power fluctuations at the rack level. EasyRider uses passive components and actively-controlled auxiliary energy storage to attenuate rack power swings. A software system continually monitors the energy storage system to maximize its lifetime in the presence of frequent charge/discharge cycles. EasyRider filters rack power variations to be within grid safety requirements without requiring software modifications to AI training frameworks or wasting energy. We evaluate EasyRider on a 400VDC-rated prototype system against published workload traces and our own GPU testbed, demonstrating its effectiveness across heterogeneous power levels and workload power profiles.