CLOct 27, 2023

Lost in Translation, Found in Spans: Identifying Claims in Multilingual Social Media

arXiv:2310.18205v1136 citationsh-index: 47
Originality Synthesis-oriented
AI Analysis

This addresses a gap in fact-checking pipelines for journalists and human fact-checkers by extending claim identification to multilingual contexts, though it is incremental as it builds on existing methods with new data.

The authors tackled the understudied problem of claim span identification in multilingual social media by creating the X-CLAIM dataset with 7K claims in five Indian languages and English, and found that encoder-only models like XLM-R outperform generative LLMs for low-resource languages.

Claim span identification (CSI) is an important step in fact-checking pipelines, aiming to identify text segments that contain a checkworthy claim or assertion in a social media post. Despite its importance to journalists and human fact-checkers, it remains a severely understudied problem, and the scarce research on this topic so far has only focused on English. Here we aim to bridge this gap by creating a novel dataset, X-CLAIM, consisting of 7K real-world claims collected from numerous social media platforms in five Indian languages and English. We report strong baselines with state-of-the-art encoder-only language models (e.g., XLM-R) and we demonstrate the benefits of training on multiple languages over alternative cross-lingual transfer methods such as zero-shot transfer, or training on translated data, from a high-resource language such as English. We evaluate generative large language models from the GPT series using prompting methods on the X-CLAIM dataset and we find that they underperform the smaller encoder-only language models for low-resource languages.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes