NICLMar 26

Evaluating Small Language Models for Front-Door Routing: A Harmonized Benchmark and Synthetic-Traffic Experiment

arXiv:2604.0236718.4h-index: 1
Predicted impact top 19% in NI · last 90 daysOriginality Synthesis-oriented
AI Analysis

For practitioners deploying multi-model inference systems, this work provides an empirical assessment of SLM-based routing, showing it is cost-feasible but not yet accurate enough for production.

The paper evaluates whether small language models (SLMs, 1-4B parameters) can serve as low-cost, low-latency routers for selecting larger models at inference time. In a benchmark and synthetic-traffic experiment, Qwen-2.5-3B achieved 0.783 accuracy with sub-second latency and zero marginal cost, but no model met the viability criterion of >=0.85 accuracy and <=2000ms P95 latency, leaving a 6-8 percentage point accuracy gap.

Selecting the appropriate model at inference time -- the routing problem -- requires jointly optimizing output quality, cost, latency, and governance constraints. Existing approaches delegate this decision to LLM-based classifiers or preference-trained routers that are themselves costly and high-latency, reducing a multi-objective optimization to single-dimensional quality prediction. We argue that small language models (SLMs, 1-4B parameters) have now achieved sufficient reasoning capability for sub-second, zero-marginal-cost, self-hosted task classification, potentially making the routing decision negligible in the inference budget. We test this thesis on a six-label taxonomy through two studies. Study 1 is a harmonized offline benchmark of Phi-3.5-mini, Qwen2.5-1.5B, and Qwen-2.5-3B on identical Azure T4 hardware, serving stack, quantization, and a fixed 60-case corpus. Qwen-2.5-3B achieves the best exact-match accuracy (0.783), the strongest latency-accuracy tradeoff, and the only nonzero accuracy on all six task families. Study 2 is a pre-registered four-arm randomized experiment under synthetic traffic with an effective sample size of 60 unique cases per arm, comparing Phi-4-mini, Qwen-2.5-3B, and DeepSeek-V3 against a no-routing control. DeepSeek-V3 attains the highest accuracy (0.830) but fails the pre-registered P95 latency gate (2,295 ms); Qwen-2.5-3B is Pareto-dominant among self-hosted models (0.793 accuracy, 988 ms median, $0 marginal cost). No model meets the standalone viability criterion (>=0.85 accuracy, <=2,000 ms P95). The cost and latency prerequisites for SLM-based routing are met; the accuracy gap of 6-8 percentage points and the untested question of whether correct classification translates to downstream output quality bound the remaining distance to production viability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes