CL AISep 4, 2025

CANDY: Benchmarking LLMs' Limitations and Assistive Potential in Chinese Misinformation Fact-Checking

Ruiling Guo, Xinwei Yang, Chen Huang, Tong Zhang, Yong Hu

arXiv:2509.03957v11 citationsh-index: 4Has CodeEMNLP

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of misinformation fact-checking in Chinese for AI researchers and practitioners, but it is incremental as it benchmarks existing limitations rather than proposing a new solution.

The researchers tackled the problem of evaluating large language models' (LLMs) effectiveness in fact-checking Chinese misinformation by creating the CANDY benchmark with a dataset of ~20k instances, finding that current LLMs are unreliable for generating accurate conclusions even with advanced prompting techniques, but show potential as assistive tools to augment human performance.

The effectiveness of large language models (LLMs) to fact-check misinformation remains uncertain, despite their growing use. To this end, we present CANDY, a benchmark designed to systematically evaluate the capabilities and limitations of LLMs in fact-checking Chinese misinformation. Specifically, we curate a carefully annotated dataset of ~20k instances. Our analysis shows that current LLMs exhibit limitations in generating accurate fact-checking conclusions, even when enhanced with chain-of-thought reasoning and few-shot prompting. To understand these limitations, we develop a taxonomy to categorize flawed LLM-generated explanations for their conclusions and identify factual fabrication as the most common failure mode. Although LLMs alone are unreliable for fact-checking, our findings indicate their considerable potential to augment human performance when deployed as assistive tools in scenarios. Our dataset and code can be accessed at https://github.com/SCUNLP/CANDY

View on arXiv PDF Code

Similar