Scalable Verification of GNN-based Job Schedulers
This work addresses verification challenges for GNN-based job schedulers, which is an incremental improvement in ensuring reliability for cluster management systems.
The paper tackled the problem of verifying that GNN-based job schedulers meet properties like strategy-proofness and stability, developing the vegas framework which achieved significant speed-up in verification compared to previous methods.
Recently, Graph Neural Networks (GNNs) have been applied for scheduling jobs over clusters, achieving better performance than hand-crafted heuristics. Despite their impressive performance, concerns remain over whether these GNN-based job schedulers meet users' expectations about other important properties, such as strategy-proofness, sharing incentive, and stability. In this work, we consider formal verification of GNN-based job schedulers. We address several domain-specific challenges such as networks that are deeper and specifications that are richer than those encountered when verifying image and NLP classifiers. We develop vegas, the first general framework for verifying both single-step and multi-step properties of these schedulers based on carefully designed algorithms that combine abstractions, refinements, solvers, and proof transfer. Our experimental results show that vegas achieves significant speed-up when verifying important properties of a state-of-the-art GNN-based scheduler compared to previous methods.