DCLGOct 15, 2025

Adaptive Rescheduling in Prefill-Decode Disaggregated LLM Inference

arXiv:2510.13668v11 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses performance issues in LLM inference systems for real-world applications, offering an incremental improvement over existing disaggregated architectures.

The paper tackles the problem of workload imbalance in LLM inference caused by variable output lengths, proposing ARES, an adaptive rescheduling system that reduces P99 TPOT by 74.77% and increases goodput by up to 2.24 times.

Large Language Model (LLM) inference has emerged as a fundamental paradigm. In real-world scenarios, variations in output length cause severe workload imbalance in the decode phase, particularly for long-output reasoning tasks. Existing systems, such as PD disaggregation architectures, rely on static prefill-to-decode scheduling, which often results in SLO violations and OOM failures under evolving decode workloads. In this paper, we propose ARES, an adaptive decoding rescheduling system powered by length prediction to anticipate future workloads. Our core contributions include: (1) A lightweight and continuous LLM-native prediction method that leverages LLM hidden state to model remaining generation length with high precision (reducing MAE by 49.42%) and low overhead (cutting predictor parameters by 93.28%); (2) A rescheduling solution in decode phase with : A dynamic balancing mechanism that integrates current and predicted workloads, reducing P99 TPOT by 74.77% and achieving up to 2.24 times higher goodput.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes