Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking
This work addresses a critical issue for AI researchers and developers by revealing biases in LLM-judge evaluations, potentially misleading alignment progress claims.
The paper investigates whether LLM-judge preferences in alignment benchmarking correlate with concrete metrics like safety and factuality, finding they do not and that LLM-judges prioritize style over substance, with supervised fine-tuning being more impactful than preference optimization.
The release of ChatGPT in November 2022 sparked an explosion of interest in post-training and an avalanche of new preference optimization (PO) methods. These methods claim superior alignment by virtue of better correspondence with human pairwise preferences, often measured by LLM-judges. In this work, we attempt to answer the following question -- do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not? We define a concrete metric for alignment, and introduce SOS-Bench (Substance Outweighs Style Benchmark), which is to the best of our knowledge the largest standardized, reproducible LLM meta-benchmark to date. We find that (1) LLM-judge preferences do not correlate with concrete measures of safety, world knowledge, and instruction following; (2) LLM-judges have powerful implicit biases, prioritizing style over factuality and safety; and (3) the supervised fine-tuning (SFT) stage of post-training, and not the PO stage, has the greatest impact on alignment, with data scaling and prompt diversity as the driving factors. Our codebase and complete results can be found at https://github.com/penfever/sos-bench.