CLJun 26, 2024

Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need

Yang Wang, Alberto Garcia Hernandez, Roman Kyslyi, Nicholas Kersting

arXiv:2406.18064v34.89 citations

Originality Incremental advance

AI Analysis

This addresses the need for efficient quality assessment in factual business contexts where human evaluations are resource-intensive, though it is incremental as it builds on existing RAG and LLM evaluation methods.

The study tackled the problem of evaluating answer quality in Retrieval-Augmented Generation (RAG) applications by introducing vRAG-Eval, a grading system that assesses correctness, completeness, and honesty, and found that GPT-4's evaluations aligned with human expert judgments at 83% agreement on accept/reject decisions.

We present a comprehensive study of answer quality evaluation in Retrieval-Augmented Generation (RAG) applications using vRAG-Eval, a novel grading system that is designed to assess correctness, completeness, and honesty. We further map the grading of quality aspects aforementioned into a binary score, indicating an accept or reject decision, mirroring the intuitive "thumbs-up" or "thumbs-down" gesture commonly used in chat applications. This approach suits factual business contexts where a clear decision opinion is essential. Our assessment applies vRAG-Eval to two Large Language Models (LLMs), evaluating the quality of answers generated by a vanilla RAG application. We compare these evaluations with human expert judgments and find a substantial alignment between GPT-4's assessments and those of human experts, reaching 83% agreement on accept or reject decisions. This study highlights the potential of LLMs as reliable evaluators in closed-domain, closed-ended settings, particularly when human evaluations require significant resources.

View on arXiv PDF

Similar