CL AI LGJul 3, 2025

SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model

Wencheng Zhang, Shiqin Qiao, Lingjie Luo, Yinfeng Li, Chuanyang Zheng, Qian Xu, Meng Li, Yong Gui, Yijun He, Jianing Qiu, Jindong Hong, Jiankai Sun

arXiv:2507.02822v19 citationsh-index: 8

Originality Incremental advance

AI Analysis

This work addresses cost-efficiency for users of LLMs in applications like medical question-answering, though it is incremental as it builds on existing dual-state models with a novel routing method.

The paper tackles the problem of balancing performance and cost in large language models by proposing SynapseRoute, a dynamic routing framework that assigns queries to either high-cost 'thinking' or low-cost 'non-thinking' modes based on complexity, resulting in improved accuracy (0.8390 vs. 0.8272), 36.8% faster inference, and 39.66% reduced token consumption on medical datasets.

With the widespread adoption of large language models (LLMs) in practical applications, selecting an appropriate model requires balancing not only performance but also operational cost. The emergence of reasoning-capable models has further widened the cost gap between "thinking" (high reasoning) and "non-thinking" (fast, low-cost) modes. In this work, we reveal that approximately 58% of medical questions can be accurately answered by the non-thinking mode alone, without requiring the high-cost reasoning process. This highlights a clear dichotomy in problem complexity and suggests that dynamically routing queries to the appropriate mode based on complexity could optimize accuracy, cost-efficiency, and overall user experience. Based on this, we further propose SynapseRoute, a machine learning-based dynamic routing framework that intelligently assigns input queries to either thinking or non-thinking modes. Experimental results on several medical datasets demonstrate that SynapseRoute not only improves overall accuracy (0.8390 vs. 0.8272) compared to the thinking mode alone but also reduces inference time by 36.8% and token consumption by 39.66%. Importantly, qualitative analysis indicates that over-reasoning on simpler queries can lead to unnecessary delays and even decreased accuracy, a pitfall avoided by our adaptive routing. Finally, this work further introduces the Accuracy-Inference-Token (AIT) index to comprehensively evaluate the trade-offs among accuracy, latency, and token cost.

View on arXiv PDF

Similar