SafetyRepro: Configuration-Conditional Rank Instability on Alignment Benchmarks
For practitioners relying on benchmark comparisons, this reveals a fundamental instability in pairwise model ordering that undermines the reliability of alignment evaluations.
The paper shows that configuration choices in alignment benchmarks can flip pairwise safety rankings between models, and provides a theoretical framework to detect such reversals. On every tested benchmark, configuration alone changed the verdict.
Pairwise model comparisons drawn from foundation-model benchmarks ("A is safer than B") are read as quantitative verdicts but hinge on harness choices benchmark papers under-specify. We close one theory-benchmark loop on this primitive: a finite-envelope proposition tying a measurable pairwise-disagreement rate to whether the strict ordering admits a configuration-pair reversal, paired with a commit-stamped evaluation protocol that operationalises it on widely cited alignment benchmarks. On every benchmark we test, configuration choice alone can flip the pairwise verdict; the proposition isolates this strict-reversal failure mode.