CLNov 26, 2025

Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?

Antonín Jarolím, Martin Fajčík, Lucia Makaiová

arXiv:2511.21401v11 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

This addresses the need for evidence-based fact-checking in Czech and Slovak online comments, but appears incremental as it evaluates existing LLMs on a new dataset.

The paper tackled the problem of fine-grained evidence extraction for Czech and Slovak claims in misinformation detection, finding that LLMs often fail to copy evidence verbatim from source texts, with models like llama3.1:8b achieving high correct outputs despite small size while gpt-oss-120b underperformed despite more parameters.

Misinformation frequently spreads in user comments under online news articles, highlighting the need for effective methods to detect factually incorrect information. To strongly support or refute claims extracted from such comments, it is necessary to identify relevant documents and pinpoint the exact text spans that justify or contradict each claim. This paper focuses on the latter task -- fine-grained evidence extraction for Czech and Slovak claims. We create new dataset, containing two-way annotated fine-grained evidence created by paid annotators. We evaluate large language models (LLMs) on this dataset to assess their alignment with human annotations. The results reveal that LLMs often fail to copy evidence verbatim from the source text, leading to invalid outputs. Error-rate analysis shows that the {llama3.1:8b model achieves a high proportion of correct outputs despite its relatively small size, while the gpt-oss-120b model underperforms despite having many more parameters. Furthermore, the models qwen3:14b, deepseek-r1:32b, and gpt-oss:20b demonstrate an effective balance between model size and alignment with human annotations.

View on arXiv PDF

Similar