NI CLMar 26

Evaluating Small Language Models for Front-Door Routing: A Harmonized Benchmark and Synthetic-Traffic Experiment

arXiv:2604.0236718.4h-index: 1

Predicted impact top 19% in NI · last 90 daysOriginality Synthesis-oriented

AI Analysis

For practitioners deploying multi-model inference systems, this work provides an empirical assessment of SLM-based routing, showing it is cost-feasible but not yet accurate enough for production.

The paper evaluates whether small language models (SLMs, 1-4B parameters) can serve as low-cost, low-latency routers for selecting larger models at inference time. In a benchmark and synthetic-traffic experiment, Qwen-2.5-3B achieved 0.783 accuracy with sub-second latency and zero marginal cost, but no model met the viability criterion of >=0.85 accuracy and <=2000ms P95 latency, leaving a 6-8 percentage point accuracy gap.

Selecting the appropriate model at inference time -- the routing problem -- requires jointly optimizing output quality, cost, latency, and governance constraints. Existing approaches delegate this decision to LLM-based classifiers or preference-trained routers that are themselves costly and high-latency, reducing a multi-objective optimization to single-dimensional quality prediction. We argue that small language models (SLMs, 1-4B parameters) have now achieved sufficient reasoning capability for sub-second, zero-marginal-cost, self-hosted task classification, potentially making the routing decision negligible in the inference budget. We test this thesis on a six-label taxonomy through two studies. Study 1 is a harmonized offline benchmark of Phi-3.5-mini, Qwen2.5-1.5B, and Qwen-2.5-3B on identical Azure T4 hardware, serving stack, quantization, and a fixed 60-case corpus. Qwen-2.5-3B achieves the best exact-match accuracy (0.783), the strongest latency-accuracy tradeoff, and the only nonzero accuracy on all six task families. Study 2 is a pre-registered four-arm randomized experiment under synthetic traffic with an effective sample size of 60 unique cases per arm, comparing Phi-4-mini, Qwen-2.5-3B, and DeepSeek-V3 against a no-routing control. DeepSeek-V3 attains the highest accuracy (0.830) but fails the pre-registered P95 latency gate (2,295 ms); Qwen-2.5-3B is Pareto-dominant among self-hosted models (0.793 accuracy, 988 ms median, $0 marginal cost). No model meets the standalone viability criterion (>=0.85 accuracy, <=2,000 ms P95). The cost and latency prerequisites for SLM-based routing are met; the accuracy gap of 6-8 percentage points and the untested question of whether correct classification translates to downstream output quality bound the remaining distance to production viability.

View on arXiv PDF

Similar