LGJul 31, 2025

A Single Direction of Truth: An Observer Model's Linear Residual Probe Exposes and Steers Contextual Hallucinations

Charles O'Neill, Slava Chalnev, Chi Chi Zhao, Max Kirkby, Mudith Jayasekara

arXiv:2507.23221v115.76 citationsh-index: 2

Originality Incremental advance

AI Analysis

This addresses the problem of unreliable text generation for users of AI systems, offering a practical interpretability method for detection and mitigation, though it is incremental in building on existing probe techniques.

The paper tackled contextual hallucinations in AI by using a generator-agnostic observer model with a linear probe on its residual stream to detect and steer hallucinations, achieving performance gains of 5-27 points over baselines and demonstrating causal manipulation of hallucination rates.

Contextual hallucinations -- statements unsupported by given context -- remain a significant challenge in AI. We demonstrate a practical interpretability insight: a generator-agnostic observer model detects hallucinations via a single forward pass and a linear probe on its residual stream. This probe isolates a single, transferable linear direction separating hallucinated from faithful text, outperforming baselines by 5-27 points and showing robust mid-layer performance across Gemma-2 models (2B to 27B). Gradient-times-activation localises this signal to sparse, late-layer MLP activity. Critically, manipulating this direction causally steers generator hallucination rates, proving its actionability. Our results offer novel evidence of internal, low-dimensional hallucination tracking linked to specific MLP sub-circuits, exploitable for detection and mitigation. We release the 2000-example ContraTales benchmark for realistic assessment of such solutions.

View on arXiv PDF

Similar