LGApr 14

The Verification Tax: Fundamental Limits of AI Auditing in the Rare-Error Regime

arXiv:2604.1295134.8h-index: 4

AI Analysis

For AI auditors and regulators, this establishes a hard statistical floor on calibration verification that worsens with model accuracy, challenging standard evaluation practices.

The paper proves a fundamental limit on calibration estimation accuracy, showing that as AI models improve, verifying their calibration becomes harder. Empirical validation across 27 benchmark-model pairs reveals that self-evaluation provides no information about calibration in 80% of cases, and 23% of pairwise comparisons are indistinguishable from noise.

The most cited calibration result in deep learning -- post-temperature-scaling ECE of 0.012 on CIFAR-100 (Guo et al., 2017) -- is below the statistical noise floor. We prove this is not a failure of the experiment but a law: the minimax rate for estimating calibration error with model error rate epsilon is Theta((Lepsilon/m)^{1/3}), and no estimator can beat it. This "verification tax" implies that as AI models improve, verifying their calibration becomes fundamentally harder -- with the same exponent in opposite directions. We establish four results that contradict standard evaluation practice: (1) self-evaluation without labels provides exactly zero information about calibration, bounded by a constant independent of compute; (2) a sharp phase transition at mepsilon approx 1 below which miscalibration is undetectable; (3) active querying eliminates the Lipschitz constant, collapsing estimation to detection; (4) verification cost grows exponentially with pipeline depth at rate L^K. We validate across five benchmarks (MMLU, TruthfulQA, ARC-Challenge, HellaSwag, WinoGrande; ~27,000 items) with 6 LLMs from 5 families (8B-405B parameters, 27 benchmark-model pairs with logprob-based confidence), 95% bootstrap CIs, and permutation tests. Self-evaluation non-significance holds in 80% of pairs. Across frontier models, 23% of pairwise comparisons are indistinguishable from noise, implying that credible calibration claims must report verification floors and prioritize active querying once gains approach benchmark resolution.

View on arXiv PDF

Similar