CLLGApr 8

When Should a Language Model Trust Itself? Same-Model Self-Verification as a Conditional Confidence Signal

arXiv:2605.0291515.8
Predicted impact top 30% in CL · last 90 daysOriginality Synthesis-oriented
AI Analysis

For researchers building selective prediction systems, this work clarifies that self-verification is not a general-purpose uncertainty estimator but a conditional signal whose value depends on task, model, and baseline.

The paper evaluates same-model self-verification as a confidence signal for selective prediction, comparing it to likelihood-based baselines (LL-AVG, LL-SUM) on ARC-Challenge and TruthfulQA-MC. Results are task- and model-dependent: self-verification improves over LL-AVG on ARC-Challenge for some models (e.g., Qwen-7B), but underperforms on TruthfulQA-MC, where LL-SUM often remains stronger.

Same-model self-verification, prompting a model to audit its own predicted answer, is a plausible confidence signal for selective prediction, but its practical value remains unclear once strong likelihood-based baselines are taken seriously. We evaluate self-verification against two such baselines, LL-AVG and LL-SUM, on ARC-Challenge and TruthfulQA-MC across multiple model families, scales, and prompt variants. We measure not only correctness ranking, but also abstention quality through AURC and operating-point analyses. The results are sharply task- and model-dependent. On ARC-Challenge, self-verification substantially improves over LL-AVG for Phi-2 and the Qwen models, with the largest gains appearing in Qwen-7B. On TruthfulQA-MC, however, the signal is less reliable: smaller models can become prompt-sensitive, DeepSeek-R1-Distill-8B degrades relative to LL-AVG, and LL-SUM often remains the stronger practical baseline. We therefore do not treat self-verification as a general-purpose uncertainty estimator. In this setting, it is better understood as a conditional confidence signal whose value depends on task type, model family, prompt formulation, and, crucially, the baseline it must beat.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes