LGMay 25

Stateful Inference for Low-Latency Multi-Agent Tool Calling

arXiv:2605.2628930.5

AI Analysis

For developers deploying multi-agent LLM systems, this work reduces latency by eliminating redundant computation, addressing a critical bottleneck in interactive agentic workflows.

Existing LLM inference frameworks reprocess the entire conversation for each tool call, wasting 85-95% of computation on unchanged prompt tokens. The authors propose a stateful inference architecture that reuses a persistent KV cache across turns, achieving 2.1x speedup per turn on 6-turn workflows and 4.2x on median turns of 35-turn workflows, halving end-to-end wall time.

Multi-agent tool calling is becoming the dominant interaction pattern for LLM-based systems, yet existing inference frameworks treat each tool call as an independent request, re-processing the entire conversation from scratch even though 85-95% of the prompt is unchanged from the previous turn. We present a stateful inference architecture that converts the $O(n_t)$ per-turn cost of conventional serving into an $O(Δ_t)$ delta-only cost: a persistent KV cache lives across turns and advances by ingesting only the new tokens, while a radix prefix cache extends this across interleaved multi-agent traffic and a prompt-lookup speculative decoder accelerates structured output. Against vLLM and SGLang on novel, fully-generated workloads, the reference implementation is $2.1\times$ faster per turn on a 6-turn agentic workflow and $4.2\times$ on the median turn of a 35-turn one, halving end-to-end wall time. The advantage comes from stateful reuse and speculation, not caching.

View on arXiv PDF

Similar