CLOct 31, 2025

Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks

arXiv:2510.27106v124 citationsh-index: 33EMNLP
Originality Synthesis-oriented
AI Analysis

This highlights a critical inconsistency issue in widely adopted LLM-based evaluation methods, which is incremental as it builds on existing frameworks.

The paper tackles the problem of low intra-rater reliability in LLM-as-a-judge frameworks for evaluating natural language generation, showing that scores vary inconsistently across runs, which undermines their reliability.

As Natural Language Generation (NLG) continues to be widely adopted, properly assessing it has become quite difficult. Lately, using large language models (LLMs) for evaluating these generations has gained traction, as they tend to align more closely with human preferences than conventional n-gram or embedding-based metrics. In our experiments, we show that LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. We quantify this inconsistency across different NLG tasks and benchmarks and see if judicious use of LLM judges can still be useful following proper guidelines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes