Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG
For researchers and practitioners evaluating cited RAG systems, this work highlights a critical calibration gap and provides a benchmark to measure it.
The paper identifies a failure mode in cited RAG where topically relevant citations can under-warrant over-strong claims, termed 'citation laundering'. It introduces FORCEBENCH, a stress test for evidence-force calibration, and finds that standard prompting achieves only 47.2% monotonicity violation rate (MVR), while explicit warrant-strength prompting reduces MVR to 24.5%.
Cited RAG evaluation often treats visible sources as a grounding signal, but a real, topically relevant citation can still under-warrant the attached wording. We study this diagnostic failure as citation laundering: a related source is presented as warrant for an over-strong claim. We introduce FORCEBENCH, a contrastive stress test for evidence-force calibration. Each item holds a cited passage fixed and pairs an evidence-calibrated claim with a localized force-raised variant across five operational axes: relation, modality, scope, temporal validity, and numeric specificity. A calibrated evaluator should score the evidence-calibrated claim higher. Headline experiments use a fixed, locality-filtered 198-pair evaluation set. A citation-presence sanity check is uninformative by design; token and entity overlap still violate monotonicity on 32.8--36.4% of pairs. Across four reported model judges, standard generic support prompting is insufficient for this force-calibration stress test (aggregate MVR 47.2%), while explicit warrant-strength prompting lowers MVR to 24.5% but remains imperfect. We release the benchmark, prompts, outputs, and plug-in pipeline so citation evaluators can report monotonicity violation rate and force sensitivity alongside conventional support metrics.