LGCLAPMLNov 26, 2025

How to Correctly Report LLM-as-a-Judge Evaluations

arXiv:2511.21140v314 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of biased and unreliable automated evaluation in AI for researchers and practitioners, offering a statistically robust method that is incremental but improves upon existing approaches.

The paper tackles bias in LLM-as-a-judge evaluations by proposing a plug-in framework that corrects bias and provides statistically principled uncertainty quantification, showing it yields more reliable estimates than human-only evaluation in certain parameter regimes and remains unbiased under distribution shift.

Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators. However, imperfect sensitivity and specificity of the LLM judges induce bias in naive evaluation scores. We propose a simple plug-in framework that corrects this bias and enables statistically principled uncertainty quantification. Our framework constructs confidence intervals that account for uncertainty from both the test dataset and a human-labeled calibration dataset. Additionally, it uses an adaptive strategy to allocate calibration samples for tighter intervals. Importantly, we characterize parameter regimes defined by the true evaluation score and the LLM judge's sensitivity and specificity in which our LLM-based evaluation yields more reliable estimates than human-only evaluation. Moreover, we show that our framework remains unbiased under distribution shift between the test and calibration datasets, in contrast to existing approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes