AIMay 27

Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG

arXiv:2605.2804416.4
Predicted impact top 55% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For researchers and practitioners evaluating cited RAG systems, this work highlights a critical calibration gap and provides a benchmark to measure it.

The paper identifies a failure mode in cited RAG where topically relevant citations can under-warrant over-strong claims, termed 'citation laundering'. It introduces FORCEBENCH, a stress test for evidence-force calibration, and finds that standard prompting achieves only 47.2% monotonicity violation rate (MVR), while explicit warrant-strength prompting reduces MVR to 24.5%.

Cited RAG evaluation often treats visible sources as a grounding signal, but a real, topically relevant citation can still under-warrant the attached wording. We study this diagnostic failure as citation laundering: a related source is presented as warrant for an over-strong claim. We introduce FORCEBENCH, a contrastive stress test for evidence-force calibration. Each item holds a cited passage fixed and pairs an evidence-calibrated claim with a localized force-raised variant across five operational axes: relation, modality, scope, temporal validity, and numeric specificity. A calibrated evaluator should score the evidence-calibrated claim higher. Headline experiments use a fixed, locality-filtered 198-pair evaluation set. A citation-presence sanity check is uninformative by design; token and entity overlap still violate monotonicity on 32.8--36.4% of pairs. Across four reported model judges, standard generic support prompting is insufficient for this force-calibration stress test (aggregate MVR 47.2%), while explicit warrant-strength prompting lowers MVR to 24.5% but remains imperfect. We release the benchmark, prompts, outputs, and plug-in pipeline so citation evaluators can report monotonicity violation rate and force sensitivity alongside conventional support metrics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes