Is Escalation Worth It? A Decision-Theoretic Characterization of LLM Cascades

arXiv:2605.0635066.4

AI Analysis

For practitioners deploying LLM cascades, the paper provides theoretical insights and empirical evidence that cascade performance is limited by structural cost rather than intermediate stages, guiding more efficient cost-quality tradeoffs.

The paper develops a decision-theoretic framework for LLM cascades, characterizing the cost-quality frontier and showing that pairwise cascades often match or outperform more complex k-model cascades. On five benchmarks, a pre-generation router outperforms cascades on four datasets, primarily by avoiding the cheap model's generation cost on queries sent directly to a larger model.

Model cascades, in which a cheap LLM defers to an expensive one on low-confidence queries, are widely used to navigate the cost-quality tradeoff at deployment. Existing approaches largely treat the deferral threshold as an empirical hyperparameter, with limited guidance on the geometry of the resulting cost-quality frontier over a model pool. We develop a decision-theoretic framework grounded in constrained optimization and duality. For a two-model cascade, we establish piecewise concavity of the cost-quality frontier on decreasing-benefit regions of the confidence support, with reciprocal shadow prices linking the budget- and quality-constrained formulations. Given a pool of $k$ models, we characterize the frontier achievable by deterministic two-model threshold cascades as the pointwise envelope over $\binom{k}{2}$ pairwise cascades, with switching points where the optimal pair changes. For $k$-model cascades, we derive first-order conditions in which a single shadow price equalizes marginal quality-per-cost across stage boundaries. We validate the framework on five benchmarks (MATH, MMLU, TriviaQA, SimpleQA, LiveCodeBench) across eight models from five providers. Within the deterministic threshold-cascade class, full fixed chains underperform the pairwise envelope, and optimized subsequence cascades do not deliver practically meaningful held-out gains over it. A lightweight pre-generation router exceeds the best cascade policy on four of five datasets, mainly because it avoids the cheap model's generation cost on queries sent directly to a larger model rather than because of a stronger routing signal. These results suggest that cascade performance is limited primarily by structural cost, since cascades pay the cheap model before any escalation decision, rather than by a shortage of intermediate stages.

View on arXiv PDF

Similar