A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness
This exposes critical reliability issues in widely used automated evaluation methods for AI safety, potentially misleading benchmarks and safety assessments in natural language processing.
The study found that LLM-as-a-Judge frameworks for safety evaluation degrade to near random chance under adversarial attacks due to distribution shifts, with performance dropping to coin-flip levels as revealed by 6642 human-verified labels, and many attacks inflate success rates by exploiting judge insufficiencies rather than eliciting genuine harm.
Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon to evaluate harmfulness in order to benchmark the robustness of safety against adversarial attacks. However, we show that existing validation protocols fail to account for substantial distribution shifts inherent to red-teaming: diverse victim models exhibit distinct generation styles, attacks distort output patterns, and semantic ambiguity varies significantly across jailbreak scenarios. Through a comprehensive audit using 6642 human-verified labels, we reveal that the unpredictable interaction of these shifts often causes judge performance to degrade to near random chance. This stands in stark contrast to the high human agreement reported in prior work. Crucially, we find that many attacks inflate their success rates by exploiting judge insufficiencies rather than eliciting genuinely harmful content. To enable more reliable evaluation, we propose ReliableBench, a benchmark of behaviors that remain more consistently judgeable, and JudgeStressTest, a dataset designed to expose judge failures. Data available at: https://github.com/SchwinnL/LLMJudgeReliability.