CLAILGDec 12, 2025

Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols

arXiv:2512.11614v2h-index: 9
Originality Highly original
AI Analysis

This addresses the critical issue of unreliable outputs in RAG systems for users relying on AI-generated answers, offering a novel interactive-proof approach rather than incremental improvements.

The paper tackles the problem of hallucinations in retrieval-augmented generation (RAG) systems by introducing a training framework based on Merlin-Arthur protocols, which treats retrieval as verifiable evidence rather than a heuristic. The result is improved grounding in evidence, increased information-theoretic measures (soundness, completeness), and reduced hallucinations across multiple datasets and LLM families without manual annotations.

Retrieval-augmented generation (RAG) relies on retrieved context to guide large language models (LLM), yet treats retrieval as a weak heuristic rather than verifiable evidence -- leading to unsupported answers, hallucinations, and reliance on spurious context. We introduce a novel training framework that treats the RAG pipeline as an interactive proof system by adapting the Merlin-Arthur (M/A) protocol: Arthur (the generator LLM) trains on questions with unknown context provenance and Merlin gives helpful evidence, while Morgana injects adversarial, misleading context. Both use an XAI method to identify and modify evidence most influential to Arthur. This trains Arthur to (1) answer when evidence supports the answer, (2) reject when evidence is insufficient, and (3) rely on the context spans that truly ground the answer. We further introduce a verification framework that disentangles explanation fidelity from model predictive errors, and introduce the Explained Information Fraction (EIF), which normalizes M/A mutual-information guarantees. Across three RAG datasets and multiple LLM families and sizes, M/A training makes LLMs more grounded in evidence, increases information theoretic measures (soundness, completeness) and reject behavior with less hallucinations, without manually annotated unanswerable samples. Finally, the retriever also improves recall and MRR via automatically generated M/A hard positives and negatives. While high accuracy does not guarantee entropy flow from context to answer, our EIF results show that autonomous interactive-proof-style supervision enables RAG systems that treat retrieved documents as verifiable evidence. % rather than suggestions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes