Attention-Informed Surrogates for Navigating Power-Performance Trade-offs in HPC
This work addresses the challenge of optimizing node allocation for HPC schedulers, offering an incremental improvement through embedding-informed surrogates.
The paper tackled the problem of balancing performance and power in HPC scheduling by developing a surrogate-assisted multi-objective Bayesian optimization framework, which consistently identified higher-quality Pareto fronts on production datasets and reduced training costs.
High-Performance Computing (HPC) schedulers must balance user performance with facility-wide resource constraints. The task boils down to selecting the optimal number of nodes for a given job. We present a surrogate-assisted multi-objective Bayesian optimization (MOBO) framework to automate this complex decision. Our core hypothesis is that surrogate models informed by attention-based embeddings of job telemetry can capture performance dynamics more effectively than standard regression techniques. We pair this with an intelligent sample acquisition strategy to ensure the approach is data-efficient. On two production HPC datasets, our embedding-informed method consistently identified higher-quality Pareto fronts of runtime-power trade-offs compared to baselines. Furthermore, our intelligent data sampling strategy drastically reduced training costs while improving the stability of the results. To our knowledge, this is the first work to successfully apply embedding-informed surrogates in a MOBO framework to the HPC scheduling problem, jointly optimizing for performance and power on production workloads.