CLAIApr 14, 2025

C-FAITH: A Chinese Fine-Grained Benchmark for Automated Hallucination Evaluation

Peking U
arXiv:2504.10167v12 citationsh-index: 5CIKM
Originality Incremental advance
AI Analysis

This work addresses the problem of costly and manual hallucination evaluation for Chinese LLM developers and researchers, though it is incremental as it builds on existing benchmark concepts with a focus on automation and Chinese language.

The authors tackled the challenge of automated hallucination evaluation in Chinese large language models by introducing HaluAgent, an agentic framework that automatically constructs fine-grained QA datasets from knowledge documents, resulting in the creation of C-FAITH, a benchmark with 60,702 entries from 1,399 documents, and they evaluated 16 mainstream LLMs with it.

Despite the rapid advancement of large language models, they remain highly susceptible to generating hallucinations, which significantly hinders their widespread application. Hallucination research requires dynamic and fine-grained evaluation. However, most existing hallucination benchmarks (especially in Chinese language) rely on human annotations, making automatical and cost-effective hallucination evaluation challenging. To address this, we introduce HaluAgent, an agentic framework that automatically constructs fine-grained QA dataset based on some knowledge documents. Our experiments demonstrate that the manually designed rules and prompt optimization can improve the quality of generated data. Using HaluAgent, we construct C-FAITH, a Chinese QA hallucination benchmark created from 1,399 knowledge documents obtained from web scraping, totaling 60,702 entries. We comprehensively evaluate 16 mainstream LLMs with our proposed C-FAITH, providing detailed experimental results and analysis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes