AILGFeb 26

CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

arXiv:2602.22557v11 citationsh-index: 4
Originality Highly original
AI Analysis

This framework provides a robust, interpretable, and adaptable solution for LLM safety, particularly for organizations needing to enforce new governance rules without expensive retraining.

This paper introduces CourtGuard, a retrieval-augmented multi-agent framework that reframes LLM safety evaluation as an "Evidentiary Debate" to overcome the adaptation rigidity of static classifiers. It achieves state-of-the-art performance across 7 safety benchmarks without fine-tuning and demonstrates 90% accuracy on an out-of-domain Wikipedia Vandalism task by simply changing the reference policy.

Current safety mechanisms for Large Language Models (LLMs) rely heavily on static, fine-tuned classifiers that suffer from adaptation rigidity, the inability to enforce new governance rules without expensive retraining. To address this, we introduce CourtGuard, a retrieval-augmented multi-agent framework that reimagines safety evaluation as Evidentiary Debate. By orchestrating an adversarial debate grounded in external policy documents, CourtGuard achieves state-of-the-art performance across 7 safety benchmarks, outperforming dedicated policy-following baselines without fine-tuning. Beyond standard metrics, we highlight two critical capabilities: (1) Zero-Shot Adaptability, where our framework successfully generalized to an out-of-domain Wikipedia Vandalism task (achieving 90\% accuracy) by swapping the reference policy; and (2) Automated Data Curation and Auditing, where we leveraged CourtGuard to curate and audit nine novel datasets of sophisticated adversarial attacks. Our results demonstrate that decoupling safety logic from model weights offers a robust, interpretable, and adaptable path for meeting current and future regulatory requirements in AI governance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes