Multilingual and Cross-Lingual Citation Needed Detection on Wikipedia for Lower-Resource Languages
This work is significant for lower-resource Wikipedia communities by providing an accessible and effective solution for automated fact-checking, which is an incremental improvement over existing methods.
This paper addresses Citation Needed Detection (CND) on Wikipedia, a task in automated fact-checking, particularly for lower-resource languages. The authors introduce MCN, a multilingual CND corpus across 18 languages, and demonstrate that small decoder-based language models (SLMs) fine-tuned with an encoder-style objective significantly outperform prompted large language models (LLMs) across languages, even in cross-lingual settings with minimal target-language adaptation.
In automated fact-checking (AFC), check-worthiness detection identifies claims requiring verification based on domain-specific criteria. On Wikipedia, this task instantiates as Citation Needed Detection (CND), which flags claims lacking supporting citations. However, existing research has largely overlooked lower-resource languages, and recent AFC pipelines rely on large language models (LLMs), which are inaccessible to low-resource organizations. We introduce MCN, a multilingual CND corpus spanning 18 languages across three resource levels, on which we conduct an extensive study of small decoder-based language models (SLMs). Our experiments show that SLMs fine-tuned with an encoder-style objective substantially outperform prompted LLMs across languages. We further present one of the first studies on cross-lingual CND, demonstrating that SLMs fine-tuned solely on English claims surpass LLMs, even with little to no target-language adaptation. Our findings have important implications for lower-resource Wikipedia communities and suggest that compact, task-specific models are preferable to LLMs for CND. We release all data and code at https://github.com/gerritq/mcn