CLAIMay 13

Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry

arXiv:2605.1377236.7
AI Analysis

For practitioners needing to localize reasoning errors in LLMs, this work offers a novel trajectory-based detection approach but reveals a critical deployment obstacle due to distribution shift.

The paper proposes a method to detect the first error in multi-step LLM reasoning by analyzing hidden-state trajectory geometry, achieving superior performance over baselines on ProcessBench, PRM800K, HaluEval, and TruthfulQA, but finds that the distilled student model fails under distribution shift.

Large language models hallucinate during multi-step reasoning, but most existing detectors operate at the trace level: they assign one confidence score to a full output, fail to localize the first error, and often require multiple sampled completions. We frame hallucination instead as a property of the hidden-state trajectory produced during a single forward pass. Correct reasoning moves through a stable manifold of locally coherent transitions; a first error appears as a localized excursion in transport cost away from this manifold. We operationalize this view with a label-conditioned teacher that builds a trace-specific contrastive PCA lens and scores each step with seven geometric transition features, and a deployable BiLSTM student distilled from the teacher that operates on raw hidden states without inference-time labels. We prove that contrastive PCA is the optimal projection for a transport-separation objective between first error and correct states, and that single-pass first error localization holds whenever the first error creates a positive transport margin over preceding correct transitions. On ProcessBench, PRM800K, HaluEval, and TruthfulQA, both models outperform entropy-based, probing-based, and attention-based baselines in-domain; the teacher transfers stably across language models and datasets, while the student collapses under shift, a gap our distillation theory predicts. These results recast step-level hallucination detection as a problem of trajectory dynamics and identify the central obstacle to deployment: preserving the contrastive transport margin under distribution shift.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes