Hallucination as output-boundary misclassification: a composite abstention architecture for language models
This addresses the issue of hallucination in language models for users relying on accurate outputs, but it is incremental as it builds on existing methods like prompting and gating.
The paper tackled the problem of large language models producing unsupported claims by framing it as a misclassification error and proposing a composite architecture combining instruction-based refusal with a structural abstention gate. In evaluations across 50 items and three models, the composite approach achieved high overall accuracy with low hallucination, though it inherited some over-abstention from the instruction component.
Large language models often produce unsupported claims. We frame this as a misclassification error at the output boundary, where internally generated completions are emitted as if they were grounded in evidence. This motivates a composite intervention that combines instruction-based refusal with a structural abstention gate. The gate computes a support deficit score, St, from three black-box signals: self-consistency (At), paraphrase stability (Pt), and citation coverage (Ct), and blocks output when St exceeds a threshold. In a controlled evaluation across 50 items, five epistemic regimes, and three models, neither mechanism alone was sufficient. Instruction-only prompting reduced hallucination sharply, but still showed over-cautious abstention on answerable items and residual hallucination for GPT-3.5-turbo. The structural gate preserved answerable accuracy across models but missed confident confabulation on conflicting-evidence items. The composite architecture achieved high overall accuracy with low hallucination, while also inheriting some over-abstention from the instruction component. A supplementary 100-item no-context stress test derived from TruthfulQA showed that structural gating provides a capability-independent abstention floor. Overall, instruction-based refusal and structural gating show complementary failure modes, which suggests that effective hallucination control benefits from combining both mechanisms.