CLFeb 26, 2025

Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models

arXiv:2502.18817v17 citationsh-index: 31Has CodeACL
Originality Incremental advance
AI Analysis

This work addresses the challenge of fair evaluation for RAG models, which is crucial for researchers and developers in natural language processing, though it is incremental as it builds on existing LLM-based judgment methods.

The paper tackles the problem of inconsistent automated evaluation for Retrieval-Augmented Generation (RAG) models by introducing the Judge-Consistency (ConsJudge) method, which improves LLM-based judgments to provide more accurate evaluations, as shown through experiments across various models and datasets.

Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs). However, existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation. LLM-based judgment models provide the potential to produce high-quality judgments, but they are highly sensitive to evaluation prompts, leading to inconsistencies when judging the output of RAG models. This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models. Specifically, ConsJudge prompts LLMs to generate different judgments based on various combinations of judgment dimensions, utilize the judge-consistency to evaluate these judgments and select the accepted and rejected judgments for DPO training. Our experiments show that ConsJudge can effectively provide more accurate judgments for optimizing RAG models across various RAG models and datasets. Further analysis reveals that judgments generated by ConsJudge have a high agreement with the superior LLM. All codes are available at https://github.com/OpenBMB/ConsJudge.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes