Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation
It addresses the problem of evaluating high-risk medical QA systems under mixed evidence, providing a more granular certification than binary answer-or-abstain.
The paper introduces claim-selective certification for medical RAG systems, decomposing responses into verifiable claims scored against evidence and mapped to actions (full, partial, conflict, abstain). On a weak-label protocol, the system achieves UCCR=0.0000, PAU=1.0000, and action accuracy=0.9204 on dev (n=314), with similar results on test (n=319).
Medical RAG systems in high-risk QA settings are often evaluated through a single answer-or-abstain decision, but mixed evidence may support one claim, require conditions for another, and contradict a third. We study claim-selective certification: each response is decomposed into verifiable claims, scored against retrieved evidence, and mapped by an intent-aware selector to {full, partial, conflict, abstain}. On the primary weak-label certificate protocol, whose real-source-only dev/test rows cover the naturally occurring non-abstain actions, the full system records UCCR=0.0000, PAU=1.0000, PAU Precision=0.9901, and action accuracy=0.9204 on dev (n=314), and UCCR=0.0000, PAU=0.9967, PAU Precision=0.9739, and action accuracy=0.8997 on test (n=319). UCCR measures unsupported-claim risk within the certificate definition, and a source-missing counterfactual slice evaluates abstain under empty evidence. Shortcut controls quantify the action-label prior explained by source and intent metadata, while source/evidence-novel slices characterize transfer boundaries. The resulting interface separates action-label prediction from evidence-linked claim selection under mixed evidence.