CLFeb 26, 2025

Bi'an: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation

Zhouyu Jiang, Mengshu Sun, Zhiqiang Zhang, Lei Liang

arXiv:2502.19209v16.72 citationsh-index: 9Has Code

Originality Incremental advance

AI Analysis

This addresses the lack of comprehensive benchmarks and domain-optimized models for RAG hallucination detection, which is an incremental improvement for researchers and practitioners in natural language processing.

The paper tackles the problem of hallucination detection in Retrieval-Augmented Generation (RAG) by introducing Bi'an, a framework with a bilingual benchmark dataset and lightweight judge models, where their 14B model outperforms baseline models with over five times larger parameters and rivals state-of-the-art closed-source LLMs.

Retrieval-Augmented Generation (RAG) effectively reduces hallucinations in Large Language Models (LLMs) but can still produce inconsistent or unsupported content. Although LLM-as-a-Judge is widely used for RAG hallucination detection due to its implementation simplicity, it faces two main challenges: the absence of comprehensive evaluation benchmarks and the lack of domain-optimized judge models. To bridge these gaps, we introduce \textbf{Bi'an}, a novel framework featuring a bilingual benchmark dataset and lightweight judge models. The dataset supports rigorous evaluation across multiple RAG scenarios, while the judge models are fine-tuned from compact open-source LLMs. Extensive experimental evaluations on Bi'anBench show our 14B model outperforms baseline models with over five times larger parameter scales and rivals state-of-the-art closed-source LLMs. We will release our data and models soon at https://github.com/OpenSPG/KAG.

View on arXiv PDF Code

Similar