CLMar 7, 2025

No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding

arXiv:2503.05061v147 citationsh-index: 9
Originality Incremental advance
AI Analysis

This work addresses limitations in automated evaluation for AI-generated text, which is crucial for researchers and practitioners relying on low-cost LLM-based assessments, though it is incremental as it builds on known biases and proposes a specific mitigation.

The paper tackles the problem of biases in LLM-as-a-Judge frameworks when evaluating correctness in conversational responses, showing that LLM judges struggle on questions they cannot answer themselves, and recommends using human-written reference answers to improve agreement with human annotators, with experiments demonstrating that a weaker judge with high-quality references outperforms a stronger judge with synthetic ones.

LLM-as-a-Judge is a framework that uses an LLM (large language model) to evaluate the quality of natural language text - typically text that is also generated by an LLM. This framework holds great promise due to its relative low-cost, ease of use, and strong correlations with human stylistic preferences. However, LLM Judges have been shown to exhibit biases that can distort their judgments. We evaluate how well LLM Judges can grade whether a given response to a conversational question is correct, an ability crucial to soundly estimating the overall response quality. To do so, we create and publicly release a human-annotated dataset with labels of correctness for 1,200 LLM responses. We source questions from a combination of existing datasets and a novel, challenging benchmark (BFF-Bench) created for this analysis. We demonstrate a strong connection between an LLM's ability to correctly answer a question and grade responses to that question. Although aggregate level statistics might imply a judge has high agreement with human annotators, it will struggle on the subset of questions it could not answer. To address this issue, we recommend a simple solution: provide the judge with a correct, human-written reference answer. We perform an in-depth analysis on how reference quality can affect the performance of an LLM Judge. We show that providing a weaker judge (e.g. Qwen 2.5 7B) with higher quality references reaches better agreement with human annotators than a stronger judge (e.g. GPT-4o) with synthetic references.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes