CLMay 29

Multilingual and Cross-Lingual Citation Needed Detection on Wikipedia for Lower-Resource Languages

Gerrit Quaremba, Amy Rechkemmer, Elizabeth Black, Denny Vrandečić, Elena Simperl

arXiv:2605.3113627.7Has Code

Predicted impact top 53% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This work is significant for lower-resource Wikipedia communities by providing an accessible and effective solution for automated fact-checking, which is an incremental improvement over existing methods.

This paper addresses Citation Needed Detection (CND) on Wikipedia, a task in automated fact-checking, particularly for lower-resource languages. The authors introduce MCN, a multilingual CND corpus across 18 languages, and demonstrate that small decoder-based language models (SLMs) fine-tuned with an encoder-style objective significantly outperform prompted large language models (LLMs) across languages, even in cross-lingual settings with minimal target-language adaptation.

In automated fact-checking (AFC), check-worthiness detection identifies claims requiring verification based on domain-specific criteria. On Wikipedia, this task instantiates as Citation Needed Detection (CND), which flags claims lacking supporting citations. However, existing research has largely overlooked lower-resource languages, and recent AFC pipelines rely on large language models (LLMs), which are inaccessible to low-resource organizations. We introduce MCN, a multilingual CND corpus spanning 18 languages across three resource levels, on which we conduct an extensive study of small decoder-based language models (SLMs). Our experiments show that SLMs fine-tuned with an encoder-style objective substantially outperform prompted LLMs across languages. We further present one of the first studies on cross-lingual CND, demonstrating that SLMs fine-tuned solely on English claims surpass LLMs, even with little to no target-language adaptation. Our findings have important implications for lower-resource Wikipedia communities and suggest that compact, task-specific models are preferable to LLMs for CND. We release all data and code at https://github.com/gerritq/mcn

View on arXiv PDF Code

Similar