AIMay 22

How Well Do Models Follow Their Constitutions?

Arya Jakkli, Senthooran Rajamanoharan, Neel Nanda

arXiv:2605.2422979.01 citations

Predicted impact top 54% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For AI developers and regulators, this work provides a rigorous method to audit compliance with behavioral specifications, revealing that newer models follow their constitutions substantially better, though remaining failures cluster around persona manipulation, irreversible actions, and fabricated claims.

The paper proposes a multi-method audit pipeline to evaluate how well frontier AI models follow their written behavioral specifications (constitutions) under adversarial, multi-turn pressure. Applying it to Anthropic's and OpenAI's models, they find violation rates drop from 15.0% to 2.0% (Claude) and 11.7% to 3.6% (GPT) across generations, with severity ceilings also decreasing.

Frontier AI developers now train models against long written behavioral specifications, such as Anthropic's constitution (Anthropic, 2025a) and OpenAI's Model Spec (OpenAI, 2025a), integrated into post-training via methods like character training (Anthropic, 2024) and deliberative alignment (Guan et al., 2024). These documents serve a governance function, but it is unclear how well models actually follow them under adversarial, multi-turn pressure similar to what they would face in real-world deployment. We propose a multi-method audit pipeline that treats each lab's published specification as an auditable target: it decomposes the specification into atomic testable tenets (205 for Anthropic, 197 for OpenAI), generates multi-turn adversarial scenarios with the Petri auditing agent (Anthropic, 2025b), runs a modified SURF-style rubric search (Murray et al., 2026) to catch shallow single-turn failures Petri misses, validates flagged transcripts against the relevant specification, and compares the findings against the lab's own published system card. Applying the pipeline across seven models per specification, we find that models follow their own lab's specification substantially better with each generation. On Anthropic's constitution, the Claude family falls from a 15.0% violation rate (Sonnet 4) to 2.0% (Sonnet 4.6); on OpenAI's Model Spec, the GPT family falls from 11.7% (GPT-4o) to 3.6% (GPT-5.2 medium reasoning), with the severity ceiling falling from 10/10 to 7/10. We cannot externally isolate whether these gains come from specification-specific training, broader post-training improvements, or evaluation awareness. Remaining failures cluster around operator-imposed personas under AI-identity questioning, irreversible action in agentic deployments, and fabricated quantitative claims with false precision.

View on arXiv PDF

Similar