CL AI IR LGNov 8, 2024

Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking

Mubashara Akhtar, Michael Schlichtkrull, Andreas Vlachos

arXiv:2411.05375v29.112 citationsh-index: 16Has Code

Originality Incremental advance

AI Analysis

This work addresses the limitation of current evidence evaluation metrics in automated fact-checking, which is incremental as it builds on prior methods to improve reliability for researchers and practitioners in the field.

The paper tackles the problem of evaluating evidence retrieval in automated fact-checking by introducing Ev2R, a method that combines reference-based evaluation and verdict-level proxy scoring to assess evidence alignment and verdict support. The result shows that Ev2R outperforms existing approaches in accuracy and robustness, achieving stronger correlation with human judgments and greater robustness to adversarial perturbations.

Current automated fact-checking (AFC) approaches typically evaluate evidence either implicitly via the predicted verdicts or through exact matches with predefined closed knowledge sources, such as Wikipedia. However, these methods are limited due to their reliance on evaluation metrics originally designed for other purposes and constraints from closed knowledge sources. In this work, we introduce \textbf{\textcolor{skyblue}{Ev\textsuperscript{2}}\textcolor{orangebrown}{R}} which combines the strengths of reference-based evaluation and verdict-level proxy scoring. Ev\textsuperscript{2}R jointly assesses how well the evidence aligns with the gold references and how reliably it supports the verdict, addressing the shortcomings of prior methods. We evaluate Ev\textsuperscript{2}R against three types of evidence evaluation approaches: reference-based, proxy-reference, and reference-less baselines. Assessments against human ratings and adversarial tests demonstrate that Ev\textsuperscript{2}R consistently outperforms existing scoring approaches in accuracy and robustness. It achieves stronger correlation with human judgments and greater robustness to adversarial perturbations, establishing it as a reliable metric for evidence evaluation in AFC.\footnote{Code is available at \href{https://github.com/mubasharaak/fc-evidence-evaluation}{https://github.com/mubasharaak/fc-evidence-evaluation}.}

View on arXiv PDF Code

Similar