IR AIJul 23, 2025

VERIRAG: Healthcare Claim Verification via Statistical Audit in Retrieval-Augmented Generation

Shubham Mohole, Hongjun Choi, Shusen Liu, Christine Klymko, Shashank Kushwaha, Derek Shi, Wesam Sakla, Sainyam Galhotra, Ruben Glatt

arXiv:2507.17948v16.31 citationsh-index: 19

Originality Incremental advance

AI Analysis

This addresses the need for reliable evidence vetting in clinical decision support, though it is incremental as it builds on existing RAG methods.

The paper tackled the problem of verifying healthcare claims by addressing the lack of scientific quality assessment in retrieval-augmented generation systems, introducing VERIRAG with a checklist, scoring, and dynamic threshold, which achieved F1 scores of 0.53 to 0.65, improving by 10 to 14 points over baselines.

Retrieval-augmented generation (RAG) systems are increasingly adopted in clinical decision support, yet they remain methodologically blind-they retrieve evidence but cannot vet its scientific quality. A paper claiming "Antioxidant proteins decreased after alloferon treatment" and a rigorous multi-laboratory replication study will be treated as equally credible, even if the former lacked scientific rigor or was even retracted. To address this challenge, we introduce VERIRAG, a framework that makes three notable contributions: (i) the Veritable, an 11-point checklist that evaluates each source for methodological rigor, including data integrity and statistical validity; (ii) a Hard-to-Vary (HV) Score, a quantitative aggregator that weights evidence by its quality and diversity; and (iii) a Dynamic Acceptance Threshold, which calibrates the required evidence based on how extraordinary a claim is. Across four datasets-comprising retracted, conflicting, comprehensive, and settled science corpora-the VERIRAG approach consistently outperforms all baselines, achieving absolute F1 scores ranging from 0.53 to 0.65, representing a 10 to 14 point improvement over the next-best method in each respective dataset. We will release all materials necessary for reproducing our results.

View on arXiv PDF

Similar