CL AIMay 27

Let the Results Speak: A Replication-First Paradigm for LLM Behavioral Benchmarking

Yuming, Huang, Yao Liu, Lei Wang, Junchen Wan

arXiv:2605.2791483.5

Predicted impact top 58% in CL · last 90 daysOriginality Highly original

AI Analysis

For researchers and practitioners evaluating LLM behavioral qualities, this paradigm provides a more reliable and diagnostic benchmarking method that overcomes limitations of human rater consensus and LLM-as-judge circularity.

The paper tackles the problem of subjective evaluation of LLM behavior (e.g., empathy, restraint) where human agreement is low (rho ~ 0.45) and LLM-as-judge risks circularity. They propose a replication-first paradigm using four orthogonal properties and test it on emotional accompaniment, achieving ordinal Krippendorff alpha = 0.91 and detecting performance drops (e.g., gpt-5 falls 1.87 points from gpt-4.1 on advice-restraint) that aggregate scores hide.

Subjective evaluation of LLM behavior -- empathy, restraint, calibrated emotional tone -- is hard. Human inter-rater agreement on such qualities saturates near rho ~ 0.45, and an LLM-as-judge proxy alone risks circularity: a judge sharing the target's training cohort cannot independently verify it. Anchoring validity to a single human-rater consensus does not extend to capabilities where humans themselves disagree. We propose a replication-first paradigm: instead of anchoring on one rater group, we certify the instrument via four orthogonal properties -- reliability across K runs, cross-instrument replication across architecturally distinct judges, historical-footprint calibration via judges from earlier training cohorts, and pre-registered prediction. We test it on emotional accompaniment by letting the rubric self-evolve data-driven across iterations: the dimensions are not pre-stipulated and the procedure stabilizes to a 9-dimension set. Pre-registration applies to 10 falsifiable hypotheses and 11 forward predictions, committed before any test data was collected. Applied to 49 models across 8 families, the paradigm surfaces what aggregate scores hide. On advice-restraint -- whether a model refrains from giving unsolicited solutions in empathic contexts -- gpt-5 falls 1.87 points from gpt-4.1 and Opus-4.7 falls 0.629 from Opus-4.6, while aggregate scores stay flat. The regression survives three user-proxy swaps (95% of magnitude), replicates across a 5-family judge stack and a 17-month cohort gap, and persists on 74 held-out real ESConv conversations (rho in [0.749, 0.850]); the instrument reaches ordinal Krippendorff alpha = 0.91. As a by-product, the paradigm acts as a saturation-source diagnostic, separating instrumental ceilings (breakable by rubric refinement) from structural ceilings (needing scenario or roster intervention).

View on arXiv PDF

Similar