Case-Grounded Evidence Verification: A Framework for Constructing Evidence-Sensitive Supervision
This addresses the bottleneck of lacking supervision that encodes the causal role of evidence in evidence-grounded reasoning, particularly for domains like radiology, though it is incremental as it builds on existing verification methods.
The paper tackles the problem of weak supervision and evaluation in evidence-grounded reasoning by introducing a framework for constructing evidence-sensitive supervision, which generates explicit support and non-support examples without manual annotation. The trained verifier outperforms baselines, shows genuine evidence dependence, and transfers across unseen data, though it degrades under evidence-source shift.
Evidence-grounded reasoning requires more than attaching retrieved text to a prediction: a model should make decisions that depend on whether the provided evidence supports the target claim. In practice, this often fails because supervision is weak, evidence is only loosely tied to the claim, and evaluation does not test evidence dependence directly. We introduce case-grounded evidence verification, a general framework in which a model receives a local case context, external evidence, and a structured claim, and must decide whether the evidence supports the claim for that case. Our key contribution is a supervision construction procedure that generates explicit support examples together with semantically controlled non-support examples, including counterfactual wrong-state and topic-related negatives, without manual evidence annotation. We instantiate the framework in radiology and train a standard verifier on the resulting support task. The learned verifier substantially outperforms both case-only and evidence-only baselines, remains strong under correct evidence, and collapses when evidence is removed or swapped, indicating genuine evidence dependence. This behavior transfers across unseen evidence articles and an external case distribution, though performance degrades under evidence-source shift and remains sensitive to backbone choice. Overall, the results suggest that a major bottleneck in evidence grounding is not only model capacity, but the lack of supervision that encodes the causal role of evidence.