A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look
This work addresses the problem of high-cost manual relevance assessments in information retrieval for researchers and practitioners, showing that LLM-based automation is viable, though it is incremental as it builds on existing tools and benchmarks.
The study evaluated four relevance assessment approaches, including manual and LLM-assisted methods, in the TREC 2024 RAG Track, finding that automatically generated UMBRELA judgments correlate highly with manual ones in system rankings (e.g., nDCG@20, nDCG@100, Recall@100) across 77 runs from 19 teams, suggesting they can replace manual judgments for accurate effectiveness measurement.
The application of large language models to provide relevance assessments presents exciting opportunities to advance information retrieval, natural language processing, and beyond, but to date many unknowns remain. This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed in situ: the "standard" fully manual process that NIST has implemented for decades and three different alternatives that take advantage of LLMs to different extents using the open-source UMBRELA tool. This setup allows us to correlate system rankings induced by the different approaches to characterize tradeoffs between cost and quality. We find that in terms of nDCG@20, nDCG@100, and Recall@100, system rankings induced by automatically generated relevance assessments from UMBRELA correlate highly with those induced by fully manual assessments across a diverse set of 77 runs from 19 teams. Our results suggest that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness. Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits. Overall, human assessors appear to be stricter than UMBRELA in applying relevance criteria. Our work validates the use of LLMs in academic TREC-style evaluations and provides the foundation for future studies.