AICRNov 9, 2025

Efficient LLM Safety Evaluation through Multi-Agent Debate

arXiv:2511.06396v11 citationsh-index: 19
Originality Incremental advance
AI Analysis

This work addresses the scalability problem in LLM safety evaluation for researchers and practitioners, though it is incremental as it builds on existing LLM-as-a-Judge frameworks.

The authors tackled the high cost of safety evaluation for large language models by proposing a multi-agent judging framework using small language models, which achieved agreement comparable to GPT-4o judges on a new benchmark while reducing inference costs.

Safety evaluation of large language models (LLMs) increasingly relies on LLM-as-a-Judge frameworks, but the high cost of frontier models limits scalability. We propose a cost-efficient multi-agent judging framework that employs Small Language Models (SLMs) through structured debates among critic, defender, and judge agents. To rigorously assess safety judgments, we construct HAJailBench, a large-scale human-annotated jailbreak benchmark comprising 12,000 adversarial interactions across diverse attack methods and target models. The dataset provides fine-grained, expert-labeled ground truth for evaluating both safety robustness and judge reliability. Our SLM-based framework achieves agreement comparable to GPT-4o judges on HAJailBench while substantially reducing inference cost. Ablation results show that three rounds of debate yield the optimal balance between accuracy and efficiency. These findings demonstrate that structured, value-aligned debate enables SLMs to capture semantic nuances of jailbreak attacks and that HAJailBench offers a reliable foundation for scalable LLM safety evaluation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes