CLLGOct 26, 2025

Interpreting and Mitigating Unwanted Uncertainty in LLMs

arXiv:2510.22866v11 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses trust and risk issues in high-stakes domains for LLM users, though it is incremental as it builds on existing interpretability frameworks.

The authors tackled the problem of unwanted uncertainty in Large Language Models, where models flip correct answers to incorrect ones upon re-prompting, and found that masking specific non-retrieval attention heads reduces flip behavior by up to 15% without causing incoherence.

Despite their impressive capabilities, Large Language Models (LLMs) exhibit unwanted uncertainty, a phenomenon where a model changes a previously correct answer into an incorrect one when re-prompted. This behavior undermines trust and poses serious risks in high-stakes domains. In this work, we investigate the mechanisms that drive this phenomenon. We adapt the Needle-in-a-Haystack retrieval framework and integrate a Flip-style re-evaluation prompt to simulate realistic answer-flipping scenarios. We find that retrieval heads are not primarily responsible for avoiding uncertainty. Instead, we identify a small set of non-retrieval attention heads that disproportionately attend to misleading tokens in uncertain contexts. Masking these heads yields significant improvements, reducing flip behavior by up to 15% without introducing incoherence or overcorrection. However, when tested for downstream tasks, we observe trade-offs with flip behavior. Our findings contribute to the growing field of mechanistic interpretability and present a simple yet effective technique for mitigating uncertainty-driven failure modes in LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes