LGAIDCPFSep 25, 2025

Prompt-Aware Scheduling for Low-Latency LLM Serving

arXiv:2510.03243v22 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses latency issues in LLM serving for users of systems like vLLM, though it is incremental as it builds on existing scheduling methods.

The paper tackles the problem of low latency in LLM serving by introducing PARS, a prompt-aware scheduler that reduces latency by approximating shortest-job-first scheduling, with experiments showing significant performance improvements across multiple LLMs and real-world datasets.

Efficient scheduling of LLM inference tasks is essential for achieving low latency and high throughput, particularly with the growing use of reasoning-capable LLMs. Traditional strategies like First-Come-First-Serve (FCFS) often suffer from Head-of-Line (HOL) blocking, where long-running tasks delay shorter ones queued behind them. In this paper, we introduce PARS, a prompt-aware LLM task scheduler that improves serving efficiency by approximating shortest-job-first (SJF) scheduling through pairwise ranking with margin ranking loss. PARS focuses on impactful scheduling decisions and is seamlessly integrated into the state-of-the-art LLM serving system vLLM. It effectively predicts response-length-based task ordering, reducing latency with minimal overhead. Extensive experiments across multiple LLMs and real-world inference datasets show that PARS significantly improves performance, including for reasoning workloads. Furthermore, our cross-model evaluations demonstrate that the design generalizes well, enabling effective scheduling even when predictors are trained on different LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes