Concurrent Criterion Validation of a Validity Screen for LLM Confidence Signals via Selective Prediction

arXiv:2604.1771695.63 citationsh-index: 2

Predicted impact top 10% in CL · last 90 daysOriginality Incremental advance

AI Analysis

Provides a validated method to assess LLM confidence reliability for selective prediction, crucial for high-stakes applications.

The validity screen for LLM confidence signals predicts selective prediction performance, with valid models achieving mean Type 2 AUROC = .624 and invalid models .357 (Cohen's d = 2.81, p = .002), accounting for 47% of variance in AUROC.

The validity screen (Cacioli, 2026d, 2026e) classifies LLM confidence signals as Valid, Indeterminate, or Invalid. We test whether these classifications predict selective prediction performance. Twenty frontier LLMs from seven families were evaluated on 524 items across six cognitive tracks. Valid models show mean Type 2 AUROC = .624 (SD = .048). Invalid models show mean AUROC = .357 (SD = .231). Cohen's d = 2.81, p = .002. The tiers order monotonically: Invalid (.357) < Indeterminate (.554) < Valid (.624). Split-half cross-validation yields median d = 1.77, P(d > 0) = 1.0 across 1,000 splits. The three-tier classification accounts for 47% of the variance in AUROC. DeepSeek-R1 drops from 85.3% accuracy at full coverage to 11.3% at 10% coverage. The screen predicts the criterion. For selective prediction, the screen matters.

View on arXiv PDF

Similar