DC AIJan 29

EWSJF: An Adaptive Scheduler with Hybrid Partitioning for Mixed-Workload LLM Inference

Bronislav Sidik, Chaya Levi, Joseph Kampeas

arXiv:2601.21758v13.32 citationsh-index: 4Has Code

Originality Highly original

AI Analysis

This addresses a fundamental scheduling problem for efficient and responsive LLM serving, offering significant performance gains for systems handling mixed workloads.

The paper tackles the challenge of scheduling mixed workloads for LLM inference, where short interactive queries and long batch requests compete, by introducing EWSJF, an adaptive scheduler that improves throughput by over 30% and reduces average Time-To-First-Token for short requests by up to 4x compared to standard FCFS policies.

Serving Large Language Models (LLMs) under mixed workloads--short, latency-sensitive interactive queries alongside long, throughput-oriented batch requests--poses a fundamental scheduling challenge. Standard First-Come, First-Served (FCFS) policies suffer from severe head-of-line blocking, leading to high tail latency and underutilized hardware. We introduce EWSJF (Effective Workload-based Shortest Job First), an adaptive request-level scheduler that learns workload structure in real time to jointly improve fairness and throughput. EWSJF operates upstream of execution-level schedulers and integrates four components: (1) Refine-and-Prune, an unsupervised partitioning algorithm that discovers performance-homogeneous request groups; (2) Dynamic Queue Routing for assigning requests to these groups; (3) Density-Weighted Scoring, a context-aware prioritization function balancing urgency and fairness; and (4) Bayesian Meta-Optimization, which continuously tunes scoring and partitioning parameters based on live performance feedback. Implemented in vLLM, EWSJF improves end-to-end throughput by over 30% and reduces average Time-To-First-Token for short requests by up to 4x compared to FCFS. These results demonstrate that adaptive, learning-based request scheduling is a critical missing layer for efficient and responsive LLM serving. Implementation available at https://anonymous.4open.science/r/vllm_0110-32D8.

View on arXiv PDF

Similar