QUIVER: A Formal Framework for Quantifying Perturbation Propagation and Bifurcation in Compound AI Systems

arXiv:2605.2395677.6

Predicted impact top 39% in AI · last 90 daysOriginality Highly original

AI Analysis

For developers and researchers of compound AI systems, QUIVER addresses the lack of tools to measure how perturbations propagate through stochastic, graph-structured LLM pipelines, enabling better debugging and robustness analysis.

QUIVER provides a formal framework to quantify perturbation propagation in compound AI systems with graph-structured LLM pipelines, revealing distinct sensitivity profiles and cascade patterns across architectures. Validated on three pipelines with 8,200+ traces, it predicts bifurcation-prone nodes and localizes stale evaluation artifacts that aggregate metrics miss.

Compound AI systems that chain multiple LLM calls into directed computation graphs are now the dominant architecture for production AI. Although these architectures leverage heterogeneous nodes with mixed-mode outputs, no existing framework quantifies how perturbations propagate through such pipelines, where nodes are stochastic and execution paths can diverge structurally. We introduce QUIVER, a formal framework for measuring perturbation propagation in graph-structured LLM pipelines. The framework defines: (1) a sensitivity matrix with type-dispatched distance metrics that classifies edges as amplifiers, absorbers, or threshold-sensitive, complemented by occurrence-lift; (2) trajectory divergence decomposing variation into value drift, structural path divergence, and iteration count divergence; (3) bifurcation thresholds identifying the smallest perturbation that causes structural execution path changes; and (4) distribution faithfulness, quantifying when per node evaluation datasets diverge from production distributions. We validate on two production enterprise pipelines and a public DSPy multihop QA pipeline, three structurally distinct architectures. Across 8,200+ instrumented traces (32,000+ pair comparisons), we demonstrate that QUIVER reveals distinct sensitivity profiles across architectures, distinguishes mechanistically different cascade patterns producing identical divergence rates, predicts nodes prone to trajectory bifurcation from observational data alone, and localizes stale evaluation artifacts to specific node-field categories that aggregate metrics cannot surface.

View on arXiv PDF

Similar