SYSYAPMay 15

Co-Design Optimization for Data Center Cooling System via Digital Twin

arXiv:2605.1551651.7
AI Analysis

For operators of liquid-cooled exascale supercomputers, this work provides a systematic method to optimize cooling plant design and operation, yielding significant energy savings with minimal hardware changes.

This paper presents a three-layer optimization framework for co-designing data center cooling systems, achieving 35.48% annual cooling energy savings on the Frontier exascale supercomputer, only 0.18% above the current design, and reducing design sensitivity by 93% via flow fraction optimization.

Liquid-cooled exascale supercomputers dissipate heat through cooling plants organized as multiple parallel subloops, but how to allocate coolant distribution units (CDUs) across subloops and how to distribute flow among them has not been systematically addressed for facilities at this scale. This paper presents a three-layer optimization framework that jointly determines the integer partition of CDUs across subloops, the continuous flow fraction allocation, and the per-timestep co-design optimization of total flow rate and supply temperature subject to per-subloop thermal safety constraints. The Modelica simulation model is built based on the data of Frontier exascale supercomputer at Oak Ridge National Laboratory. By developing a reduced-order surrogate model, all 611 feasible partitions of 25 CDUs are evaluated across the full year operational dataset of 49,353 timesteps. Three progressively richer operational strategies are compared, ranging from flow control optimization to full three-layer co-design optimization with dynamically adjusted flow fractions. The globally optimal design is a two-subloop plant achieving 35.48% annual cooling energy savings, only 0.18% above the current three-subloop Frontier design at 35.30%. Flow fraction optimization is shown to compensate for any feasible CDU-to-subloop assignment, reducing the design sensitivity by 93% and providing a low-cost software-only pathway to near-optimal performance on the existing Frontier hardware. The framework is transferable to other liquid-cooled high-performance computing plants.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes