CL AIMay 8

Sanity Checks for Long-Form Hallucination Detection

Geigh Zollicoffer, Minh Vu, Hongli Zhan, Raymond Li, Manish Bhattarai

arXiv:2605.0834640.6

AI Analysis

For researchers developing hallucination detection methods for LLMs, this work provides sanity checks to ensure evaluations measure reasoning validity rather than surface correlates.

The paper introduces a controlled-invariance methodology to distinguish whether hallucination detection methods for LLMs evaluate reasoning traces or exploit answer-level artifacts. It shows that a lightweight scorer (TRACT) using lexical trajectory features achieves strong robustness and competitive performance, suggesting the main challenge is isolating reasoning signal from endpoint cues.

Hallucination detection methods for large language models increasingly operate on chain-of-thought reasoning traces, yet it remains unclear whether they evaluate the reasoning itself or merely exploit surface correlates of the final answer. We introduce a controlled-invariance methodology that exposes this distinction through two oracle tests: \textsc{Force}, which replaces each response's final answer with the ground truth while preserving the reasoning trace, and \textsc{Remove}, which strips answer-announcement steps while leaving the trajectory intact. This reveals if their predictive power derives from answer-level artifacts rather than from the structure or validity of intermediate reasoning. We further show that once these artifacts are controlled for, effective detection does not necessarily require complex learned representations: TRACT, a lightweight scorer built on lexical trajectory features (hedging trends, step-length dynamics, and cross-response vocabulary convergence), achieves strong robustness while remaining competitive with or outperforming existing baselines on unperturbed traces. These findings suggest that the current central challenge in reasoning-aware hallucination detection is not the absence of signal in the trace, but the failure to isolate it from endpoint cues.

View on arXiv PDF

Similar