Facts are Harder Than Opinions -- A Multilingual, Comparative Analysis of LLM-Based Fact-Checking Reliability
This work addresses the challenge of scalable misinformation detection for multilingual and diverse topics, but it is incremental as it builds on existing datasets and benchmarks.
The paper tackled the problem of automated fact-checking by evaluating five LLMs on a multilingual dataset of 61,514 claims, finding that GPT-4o had the highest accuracy but declined to classify 43% of claims, and factual-sounding claims were misclassified more often than opinions.
The proliferation of misinformation necessitates scalable, automated fact-checking solutions. Yet, current benchmarks often overlook multilingual and topical diversity. This paper introduces a novel, dynamically extensible data set that includes 61,514 claims in multiple languages and topics, extending existing datasets up to 2024. Through a comprehensive evaluation of five prominent Large Language Models (LLMs), including GPT-4o, GPT-3.5 Turbo, LLaMA 3.1, and Mixtral 8x7B, we identify significant performance gaps between different languages and topics. While overall GPT-4o achieves the highest accuracy, it declines to classify 43% of claims. Across all models, factual-sounding claims are misclassified more often than opinions, revealing a key vulnerability. These findings underscore the need for caution and highlight challenges in deploying LLM-based fact-checking systems at scale.