CL AI IRApr 21, 2025

Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges

Nandan Thakur, Ronak Pradeep, Shivani Upadhyay, Daniel Campos, Nick Craswell, Jimmy Lin

arXiv:2504.15205v122 citationsh-index: 22

Originality Incremental advance

AI Analysis

This work addresses the challenge of reliable support assessment for RAG systems, which is crucial for reducing hallucinations, but it is incremental as it builds on existing evaluation methods by comparing human and LLM judges.

The study tackled the problem of evaluating support in retrieval-augmented generation (RAG) systems by comparing human judges with an LLM judge (GPT-4o) on 45 submissions across 36 topics, finding that human and GPT-4o predictions matched perfectly in 56% of manual assessments and 72% with post-editing, and that LLM judges correlate better with independent human judges than human judges do.

Retrieval-augmented generation (RAG) enables large language models (LLMs) to generate answers with citations from source documents containing "ground truth", thereby reducing system hallucinations. A crucial factor in RAG evaluation is "support", whether the information in the cited documents supports the answer. To this end, we conducted a large-scale comparative study of 45 participant submissions on 36 topics to the TREC 2024 RAG Track, comparing an automatic LLM judge (GPT-4o) against human judges for support assessment. We considered two conditions: (1) fully manual assessments from scratch and (2) manual assessments with post-editing of LLM predictions. Our results indicate that for 56% of the manual from-scratch assessments, human and GPT-4o predictions match perfectly (on a three-level scale), increasing to 72% in the manual with post-editing condition. Furthermore, by carefully analyzing the disagreements in an unbiased study, we found that an independent human judge correlates better with GPT-4o than a human judge, suggesting that LLM judges can be a reliable alternative for support assessment. To conclude, we provide a qualitative analysis of human and GPT-4o errors to help guide future iterations of support assessment.

View on arXiv PDF

Similar