AIJan 8

Evaluative Fingerprints: Stable and Systematic Differences in LLM Evaluator Behavior

arXiv:2601.05114v14.41 citations
Originality Highly original
AI Analysis

This reveals that LLM judges are not interchangeable for scalable evaluation, as each encodes a distinct theory of quality, posing a problem for researchers and practitioners relying on such systems.

The study found that LLM-as-judge systems exhibit near-zero inter-judge agreement (Krippendorff's α = 0.042), with judges disagreeing more than random noise in some cases, yet their disagreement patterns are stable and systematic, allowing identification of specific judges with up to 99.6% accuracy within model families.

LLM-as-judge systems promise scalable, consistent evaluation. We find the opposite: judges are consistent, but not with each other; they are consistent with themselves. Across 3,240 evaluations (9 judges x 120 unique video x pack items x 3 independent runs), inter-judge agreement is near-zero (Krippendorff's α = 0.042). On two dimensions, judges disagree more than random noise would predict (α < 0). Yet this disagreement isn't chaos; it's structured. A classifier identifies which judge produced an evaluation with 77.1% accuracy from rubric scores alone, rising to 89.9% with disposition features. Within model families, the signal is even stronger: GPT-4.1 and GPT-5.2 are distinguishable with 99.6% accuracy. We call this the reliability paradox: judges cannot agree on what constitutes quality, yet their disagreement patterns are so stable they function as fingerprints. Each judge implements a distinct, stable theory of quality: an "evaluative disposition" that shapes how it interprets any rubric. We characterize these dispositions along multiple axes: harshness/leniency, dimension emphasis, within-judge stability (ICC), and evidence behavior (receipt validity, semantic linkage via NLI, and shotgun index). The implication is stark: LLM judges are not interchangeable instruments measuring a shared construct. They are distinct measurement devices, each encoding its own implicit theory of quality. Averaging their scores produces a synthetic verdict that corresponds to no judge's actual values.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes