DCJun 5

Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends

arXiv:2606.072489.0Has Code

Originality Incremental advance

AI Analysis

For edge and local LLM deployments with memory constraints, Clairvoyant provides a practical solution to reduce latency for short queries without requiring backend modifications.

Clairvoyant mitigates head-of-line blocking in serial LLM backends by predicting response length with a lightweight XGBoost classifier, achieving 70-76% P50 latency reduction for short requests under high queue pressure and 17% under steady-state load.

Serial LLM inference backends -- such as Ollama -- process requests one at a time under FCFS admission, causing Head-of-Line Blocking (HOLB) under mixed workloads at high utilisation: short factual queries can be delayed by minutes behind long generation jobs. While cloud-scale deployments mitigate HOLB via continuous batching (vLLM, Orca), these solutions require tens of GB of VRAM for concurrent KV-caches -- infeasible for memory-constrained edge and local deployments that rely on serial request dispatch. We present \clairvoyant, a drop-in sidecar proxy for any serial OpenAI-compatible backend (e.g., Ollama, llama.cpp). \clairvoyant predicts response length from 19 lightweight lexical features via an ONNX-exported XGBoost classifier, achieving 0.029\,ms per-request latency (four orders of magnitude below typical generation time). Because admission scheduling depends on relative ordering rather than exact prediction, the system optimises ranking fidelity, achieving 62--96\% in-distribution and 52--66\% cross-distribution accuracy across natural conversation datasets. We find that curated instruction datasets are degenerate training sources for length prediction: GPT-imposed brevity constraints reduce Long-class representation to under 0.02\% of examples, making natural conversation logs the only viable training source. End-to-end GPU benchmarks on an RTX~4090 show 70--76\% P50 latency reduction for short requests under maximum queue pressure (100 concurrent requests) and 17\% under steady-state Poisson arrivals ($ρ=0.74$). \clairvoyant is open-source and requires no modifications to the inference backend.

View on arXiv PDF

Similar