CLAICRLGJan 31, 2025

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

arXiv:2501.18837v1155 citationsh-index: 31
Originality Highly original
AI Analysis

This addresses security vulnerabilities in LLMs for AI safety applications, representing a strong specific gain rather than a foundational breakthrough.

The paper tackles the problem of universal jailbreaks that systematically bypass safeguards in large language models, introducing Constitutional Classifiers trained on synthetic data from natural language rules. The result shows robust defense in over 3,000 hours of red teaming with no universal jailbreaks found, maintaining deployment viability with a 0.38% increase in refusals and 23.7% inference overhead.

Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes