The LLM Effect on IR Benchmarks: A Meta-Analysis of Effectiveness, Baselines, and Contamination
This meta-analysis addresses the problem of evaluating the true impact of LLMs on IR benchmarks for researchers, highlighting potential contamination issues that could mislead progress assessments.
The study analyzed 143 publications on IR benchmarks to examine trends in retrieval effectiveness and baseline strength, finding that recent systems with LLM components achieve 8.8% higher nDCG@10 on DL20 and about 20% higher on Robust04 since 2023, but data contamination detection suggests this may be due to memorization rather than genuine advances.
Benchmark collections have long enabled controlled comparison and cumulative progress in Information Retrieval (IR). However, prior meta-analyses have shown that reported effectiveness gains often fail to accumulate, in part due to the use of weak or outdated baselines. While large language models are increasingly used in retrieval pipelines, their impact on established IR benchmarks has not been systematically analyzed. In this study, we analyze 143 publications reporting results on the TREC Robust04 collection and the TREC Deep Learning 2020 (DL20) passage retrieval benchmark to examine longitudinal trends in retrieval effectiveness and baseline strength. We observe what we term an \emph{LLM effect}: recent systems incorporating LLM components achieve 8.8\% higher nDCG@10 on DL20 compared to the best result from TREC 2020 and approximately 20\% higher on Robust04 since 2023. However, adapting a data contamination detection approach to reranking reveals measurable contamination in both benchmarks. While excluding contaminated topics reduces effectiveness, confidence intervals remain wide, making it difficult to determine whether the LLM effect reflects genuine methodological advances or memorization from pretraining data.