LG CLMay 11

UCCI: Calibrated Uncertainty for Cost-Optimal LLM Cascade Routing

arXiv:2605.1879626.3

Predicted impact top 77% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners deploying LLM cascades, UCCI provides a principled, cost-optimal routing method that eliminates per-workload threshold tuning and achieves significant cost savings with calibrated uncertainty.

UCCI introduces a calibration-first router for LLM cascades that uses isotonic regression to map token-level margin uncertainty to error probabilities and selects escalation thresholds via constrained cost minimization. On a production NER workload with 75k queries, it reduces inference cost by 31% at micro-F1=0.91 while cutting ECE from 0.12 to 0.03.

LLM cascades and model routing promise lower inference cost by sending easy queries to a small model and escalating hard ones to a large model, but most deployed routers use uncalibrated confidence scores and require per-workload threshold tuning. We present UCCI, a calibration-first router that maps token-level margin uncertainty to a per-query error probability via isotonic regression and selects the escalation threshold by constrained cost minimization. Under three explicit assumptions, threshold policies on the calibrated score are cost-optimal, and isotonic calibration achieves O(n^{-1/3}) sample complexity for expected calibration error (ECE). On a production named entity recognition workload of 75,000 queries served by 4B and 12B instruction-tuned LLMs on H100 GPUs, UCCI cuts inference cost by 31% (95% CI: [27%, 35%]) at micro-F1 = 0.91 while reducing ECE from 0.12 to 0.03. At the same operating point, UCCI beats entropy thresholding, split-conformal routing, and a FrugalGPT-style learned threshold. All cascade results use end-to-end routing on actual model outputs and measured H100 latency, not simulated routing from global accuracies or nominal API prices.

View on arXiv PDF

Similar