RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework
This addresses the problem of high data construction costs and lack of suitable metrics for RAG evaluation, particularly in domain-specific applications, representing a novel approach rather than an incremental improvement.
The paper tackles the challenge of evaluating Retrieval-Augmented Generation (RAG) systems in specialized scenarios by introducing RAGEval, a framework that generates high-quality evaluation datasets and proposes three novel metrics, with experimental results showing it outperforms zero-shot and one-shot methods in clarity, safety, conformity, and richness.
Retrieval-Augmented Generation (RAG) is a powerful approach that enables large language models (LLMs) to incorporate external knowledge. However, evaluating the effectiveness of RAG systems in specialized scenarios remains challenging due to the high costs of data construction and the lack of suitable evaluation metrics. This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios by generating high-quality documents, questions, answers, and references through a schema-based pipeline. With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance to evaluate LLM generated responses rigorously. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples. Furthermore, the use of LLMs for scoring the proposed metrics demonstrates a high level of consistency with human evaluations. RAGEval establishes a new paradigm for evaluating RAG systems in real-world applications. The code and dataset are released at https://github.com/OpenBMB/RAGEval.