CL AIMay 11

SOMA: Efficient Multi-turn LLM Serving via Small Language Model

Xueqi Cheng, Qiong Wu, Zhengyi Zhou, Xugui Zhou, Tyler Derr, Yushun Dong

arXiv:2605.1131791.1Has Code

Predicted impact top 28% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For LLM serving systems, SOMA offers an efficient way to handle multi-turn dialogues without sacrificing coherence.

SOMA reduces multi-turn LLM serving costs by using a small surrogate model fine-tuned on early-turn data, achieving up to 2x latency reduction with less than 5% quality degradation.

Large Language Models (LLMs) are increasingly deployed in multi-turn dialogue settings where preserving conversational context across turns is essential. A standard serving practice concatenates the full dialogue history at every turn, which reliably maintains coherence but incurs substantial cost in latency, memory, and API expenditure, especially when queries are routed to large proprietary models. Existing approaches often struggle to balance the trade-off between response quality and efficiency. We propose a framework that exploits the early turns of a session to estimate a local response manifold and then adapt a smaller surrogate model to this local region for the remainder of the conversation. Concretely, we learn soft prompts that maximize semantic divergence between the large and surrogate small language models' responses to surface least-aligned local directions, stabilize training with anti-degeneration control, and distill the mined cases into localized LoRA fine-tuning so the surrogate runs without prompts at inference. A simple gate enables a one-time switch with rollback on drift. We further provide a theoretical analysis for key components in SOMA. Extensive experiments show the effectiveness of SOMA. The source code is provided at: https://github.com/LabRAI/SOMA.

View on arXiv PDF Code

Similar