IRApr 27

When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment

arXiv:2602.1717052.5h-index: 9
AI Analysis

For IR evaluation researchers, this work identifies a critical bias in LLM-based relevance assessment that undermines their reliability as human proxies.

LLMs consistently assign inflated relevance scores to passages that do not genuinely satisfy the information need, revealing a system-wide bias rather than random fluctuations. The overrating is sensitive to passage length and surface-level lexical cues, raising concerns about using LLMs as replacements for human relevance assessors.

Human relevance assessment is time-consuming and cognitively intensive, limiting the scalability of Information Retrieval evaluation. This has led to growing interest in using large language models (LLMs) as proxies for human judges. However, it remains an open question whether LLM-based relevance judgments are reliable, stable, and rigorous enough to match humans for relevance assessment. In this work, we conduct a study of \textit{overrating behavior} in LLM-based relevance judgments across model backbones, evaluation paradigms (pointwise and pairwise), and passage modification strategies. We show that models consistently assign inflated relevance scores -- often with high confidence -- to passages that do not genuinely satisfy the underlying information need, revealing a system-wide bias rather than random fluctuations in judgment. Furthermore, controlled experiments show that LLM-based relevance judgments can be highly sensitive to passage length and surface-level lexical cues. These results raise concerns about the usage of LLMs as drop-in replacements for human relevance assessors, and highlight the urgent need for careful diagnostic evaluation frameworks when applying LLMs for relevance assessments. Our code and results are publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes