CLAICYFeb 21, 2025

Beyond Translation: LLM-Based Data Generation for Multilingual Fact-Checking

arXiv:2502.15419v19 citationsh-index: 1Has Code
Originality Synthesis-oriented
AI Analysis

This addresses the need for robust multilingual fact-checking systems to combat online misinformation, though it is incremental as it extends existing dataset generation methods to new languages.

The paper tackles the problem of limited multilingual resources for fact-checking by introducing MultiSynFact, a large-scale dataset of 2.2M claim-source pairs for Spanish, German, English, and low-resource languages, generated using LLMs with external knowledge and validation steps.

Robust automatic fact-checking systems have the potential to combat online misinformation at scale. However, most existing research primarily focuses on English. In this paper, we introduce MultiSynFact, the first large-scale multilingual fact-checking dataset containing 2.2M claim-source pairs designed to support Spanish, German, English, and other low-resource languages. Our dataset generation pipeline leverages Large Language Models (LLMs), integrating external knowledge from Wikipedia and incorporating rigorous claim validation steps to ensure data quality. We evaluate the effectiveness of MultiSynFact across multiple models and experimental settings. Additionally, we open-source a user-friendly framework to facilitate further research in multilingual fact-checking and dataset generation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes