Provisioning to Runtime Optimization of a +100 MW AI Cluster
It addresses the critical bottleneck of electric power supply for AI data centers, providing practical insights for operators of large-scale AI clusters.
This paper presents the first end-to-end power management process for a hyper-scale AI datacenter, from planning to runtime optimization, using detailed power measurements from a 150 MW cluster with 83K GB200 GPUs.
The electric power supply for AI data centers is now the most significant bottleneck in the race toward Artificial General Intelligence, surpassing even the constraint of AI accelerator availability. To our knowledge, this paper is the first to describe the end-to-end power management process for a hyper-scale AI datacenter; from early power planning to accommodate next-generation accelerators 6--12 months before their general availability, to tuning power settings after large scale deployment, and finally to dynamic, runtime power management for evolving workloads. We present detailed power measurements for a 150 MW datacenter hosting a cluster of 83K GB200 GPUs. We share insights from building this state-of-the-art AI cluster. We hope this work encourages practitioners across the industry to share their own experiences as well.