Laminar: A Probe-First Scheduling Paradigm with Deterministic Runtime Survival
For exascale GPU cluster operators, Laminar addresses the fragmentation and runtime survival challenges of mixed long-resident and transient workloads, offering a scalable decentralized alternative to centralized schedulers.
Laminar introduces a decentralized probe-first scheduling paradigm for exascale GPU clusters that reduces control-plane overhead to near O(1) and adds a deterministic runtime-survival layer (Airlock) to handle memory pressure via ordered suspension and recovery. It enables lifecycle-aware scheduling that preserves high-value long-resident workloads and operates closer to physical saturation.
In exascale-oriented GPU clusters, rigid-topology jobs leave behind a fragmented post-landing ecology in which long-resident workloads and highly transient tasks compete for unstable residual capacity. Existing centralized, hierarchical, and local-first decentralized schedulers incur growing coordination and retry-amplification costs in this regime and typically stop their explicit responsibility at execution start, leaving runtime survival to indiscriminate host-level OOM heuristics. We present Laminar, a decentralized probe-first, execute-later scheduling paradigm that keeps hot-path control-plane work near $\mathcal{O}(1)$ through Zone-level probabilistic flow splitting, bounded in-Zone probing by persistent lightweight agents, and node-local arbitration. Laminar further introduces Airlock, a bounded node-local runtime-survival layer that converts severe memory pressure into an ordered policy of suspension, in-situ recovery, bounded secondary re-addressing, or reclamation. By enforcing priority-ordered survival under pressure, Laminar enables lifecycle-aware scheduling that preserves high-value long-resident work and operates closer to physical saturation without relying on protocol-level overcommitment.