CANDY: Benchmarking LLMs' Limitations and Assistive Potential in Chinese Misinformation Fact-Checking
This work addresses the challenge of misinformation fact-checking in Chinese for AI researchers and practitioners, but it is incremental as it benchmarks existing limitations rather than proposing a new solution.
The researchers tackled the problem of evaluating large language models' (LLMs) effectiveness in fact-checking Chinese misinformation by creating the CANDY benchmark with a dataset of ~20k instances, finding that current LLMs are unreliable for generating accurate conclusions even with advanced prompting techniques, but show potential as assistive tools to augment human performance.
The effectiveness of large language models (LLMs) to fact-check misinformation remains uncertain, despite their growing use. To this end, we present CANDY, a benchmark designed to systematically evaluate the capabilities and limitations of LLMs in fact-checking Chinese misinformation. Specifically, we curate a carefully annotated dataset of ~20k instances. Our analysis shows that current LLMs exhibit limitations in generating accurate fact-checking conclusions, even when enhanced with chain-of-thought reasoning and few-shot prompting. To understand these limitations, we develop a taxonomy to categorize flawed LLM-generated explanations for their conclusions and identify factual fabrication as the most common failure mode. Although LLMs alone are unreliable for fact-checking, our findings indicate their considerable potential to augment human performance when deployed as assistive tools in scenarios. Our dataset and code can be accessed at https://github.com/SCUNLP/CANDY