DCPFMar 29

Operational Strategies for Non-Disruptive Scheduling Transitions in Production HPC Systems

arXiv:2603.278636.9h-index: 11
Predicted impact top 58% in DC · last 90 daysOriginality Synthesis-oriented
AI Analysis

For HPC system administrators, this provides an operational strategy to migrate scheduling policies mid-lifecycle without disrupting user workflows.

This paper presents a case study of transitioning a production HPC cluster from node-exclusive to consumable resource scheduling without disrupting active workloads. The transition reduced median queue wait times from 277 to under 3 minutes for CPU workloads and from 81 to 3.4 minutes for GPU workloads.

Migrating heterogeneous high-performance computing (HPC) systems to resource-aware scheduling introduces both technical and behavioral challenges, particularly in production environments with established user workflows. This paper presents a case study of transitioning a production academic HPC cluster from node-exclusive to consumable resource scheduling mid-lifecycle, without disrupting active workloads. We describe an operational strategy combining a time-bounded compatibility layer, observability-driven feedback, and targeted user engagement to guide adoption of explicit resource declaration. This approach protected active research workflows throughout the transition, avoiding the disruption that a direct cut-over would have imposed on the user community. Following deployment, median queue wait times fell from 277 minutes to under 3 minutes for CPU workloads and from 81 minutes to 3.4 minutes for GPU workloads. Users who adopted TRES-based submission exhibited strong long-term retention. These results demonstrate that successful scheduling transitions depend not only on system configuration, but on aligning observability, user engagement, and operational design.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes