CLJun 1, 2021

Claim Matching Beyond English to Scale Global Fact-Checking

arXiv:2106.00853v1727 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of scaling fact-checking for non-English contexts, though it is incremental as it builds on existing multilingual embedding methods.

The paper tackles the problem of scaling fact-checking globally by introducing claim matching, a task to identify similar claims across languages, and constructs a multilingual dataset with high- and low-resource languages. It demonstrates that their trained embedding model outperforms state-of-the-art models like LASER and LaBSE in all settings.

Manual fact-checking does not scale well to serve the needs of the internet. This issue is further compounded in non-English contexts. In this paper, we discuss claim matching as a possible solution to scale fact-checking. We define claim matching as the task of identifying pairs of textual messages containing claims that can be served with one fact-check. We construct a novel dataset of WhatsApp tipline and public group messages alongside fact-checked claims that are first annotated for containing "claim-like statements" and then matched with potentially similar items and annotated for claim matching. Our dataset contains content in high-resource (English, Hindi) and lower-resource (Bengali, Malayalam, Tamil) languages. We train our own embedding model using knowledge distillation and a high-quality "teacher" model in order to address the imbalance in embedding quality between the low- and high-resource languages in our dataset. We provide evaluations on the performance of our solution and compare with baselines and existing state-of-the-art multilingual embedding models, namely LASER and LaBSE. We demonstrate that our performance exceeds LASER and LaBSE in all settings. We release our annotated datasets, codebooks, and trained embedding model to allow for further research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes