LGMay 25

SafetyRepro: Configuration-Conditional Rank Instability on Alignment Benchmarks

arXiv:2605.254928.6

Predicted impact top 48% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners relying on benchmark comparisons, this reveals a fundamental instability in pairwise model ordering that undermines the reliability of alignment evaluations.

The paper shows that configuration choices in alignment benchmarks can flip pairwise safety rankings between models, and provides a theoretical framework to detect such reversals. On every tested benchmark, configuration alone changed the verdict.

Pairwise model comparisons drawn from foundation-model benchmarks ("A is safer than B") are read as quantitative verdicts but hinge on harness choices benchmark papers under-specify. We close one theory-benchmark loop on this primitive: a finite-envelope proposition tying a measurable pairwise-disagreement rate to whether the strict ordering admits a configuration-pair reversal, paired with a commit-stamped evaluation protocol that operationalises it on widely cited alignment benchmarks. On every benchmark we test, configuration choice alone can flip the pairwise verdict; the proposition isolates this strict-reversal failure mode.

View on arXiv PDF

Similar