ET AI CL SYOct 9, 2025

When to Reason: Semantic Router for vLLM

Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo, Junchen Jiang, Huamin Chen

arXiv:2510.08731v18.08 citationsh-index: 7Has Code

Originality Incremental advance

AI Analysis

This addresses efficiency and cost issues in open-source LLM serving systems by reducing unnecessary reasoning for simple prompts, though it is incremental as it builds on existing reasoning methods.

The paper tackles the problem of unnecessary reasoning costs in LLMs by introducing a semantic router that selectively applies reasoning only when beneficial, achieving a 10.2 percentage point accuracy improvement on MMLU-Pro while reducing latency by 47.1% and token consumption by 48.5% compared to direct inference with vLLM.

Large Language Models (LLMs) demonstrate substantial accuracy gains when augmented with reasoning modes such as chain-of-thought and inference-time scaling. However, reasoning also incurs significant costs in inference latency and token usage, with environmental and financial impacts, which are unnecessary for many simple prompts. We present a semantic router that classifies queries based on their reasoning requirements and selectively applies reasoning only when beneficial. Our approach achieves a 10.2 percentage point improvement in accuracy on the MMLU-Pro benchmark while reducing response latency by 47.1% and token consumption by 48.5% compared to direct inference with vLLM. These results demonstrate that semantic routing offers an effective mechanism for striking a balance between accuracy and efficiency in open-source LLM serving systems

View on arXiv PDF

Similar