DCApr 17

Accuracy Is Speed: Towards Long-Context-Aware Routing for Distributed LLM Serving

Takeshi Yoshimura, Valentijn Dymphnus van de Beek, Tatsuhiro Chiba

arXiv:2604.1573258.4h-index: 9

Predicted impact top 22% in DC · last 90 daysOriginality Incremental advance

AI Analysis

For distributed LLM serving systems, this work highlights that accuracy should be a first-class systems objective under long-context workloads, addressing a previously overlooked latency-accuracy tradeoff.

The paper shows that under long-context workloads, inference accuracy variance causes retries, making accuracy a direct factor in user-visible latency. They propose a new metric (TTCA) and a routing method (LAAR) to reduce time to correct answer.

Distributed LLM serving systems optimize per-request latency and throughput. However, under long-context workloads, inference accuracy becomes more variable. When incorrect responses trigger retries, accuracy directly translates into cumulative user-visible delay that is not captured by single-shot latency metrics. In this work, we argue that under long-context serving, \textbf{accuracy becomes speed} through retry dynamics. We introduce \textit{Time-to-Correct-Answer (TTCA)}, a metric that measures the wall-clock time required to obtain the first correct response. Our measurement study shows that prompt characteristics such as length and language amplify accuracy variance, which inflates TTCA. We demonstrate \textit{Lightweight Accuracy-Aware Routing (LAAR)}, a capability-based routing design that reduces TTCA. Our results suggest that in long-context distributed serving, accuracy should be treated as a first-class systems objective.

View on arXiv PDF

Similar