The Phenomenology of Hallucinations

arXiv:2603.1391160.1h-index: 19

Predicted impact top 44% in AI · last 90 daysOriginality Highly original

AI Analysis

This addresses the critical issue of hallucination in language models for AI safety and reliability, providing a mechanistic explanation rather than an incremental improvement.

The paper tackles the problem of why language models hallucinate, finding that it's not due to a failure to detect uncertainty but because of a weak integration of uncertainty into output generation, with uncertain inputs having 2-3 times the intrinsic dimensionality of factual ones. The result is that internal uncertainty signals are geometrically amplified yet functionally silent, leading to hallucinations despite detection.

We show that language models hallucinate not because they fail to detect uncertainty, but because of a failure to integrate it into output generation. Across architectures, uncertain inputs are reliably identified, occupying high-dimensional regions with 2-3$\times$ the intrinsic dimensionality of factual inputs. However, this internal signal is weakly coupled to the output layer: uncertainty migrates into low-sensitivity subspaces, becoming geometrically amplified yet functionally silent. Topological analysis shows that uncertainty representations fragment rather than converging to a unified abstention state, while gradient and Fisher probes reveal collapsing sensitivity along the uncertainty direction. Because cross-entropy training provides no attractor for abstention and uniformly rewards confident prediction, associative mechanisms amplify these fractured activations until residual coupling forces a committed output despite internal detection. Causal interventions confirm this account by restoring refusal when uncertainty is directly connected to logits.

View on arXiv PDF

Similar