AIMay 27

When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models

arXiv:2605.2785116.4

Predicted impact top 43% in AI · last 90 daysOriginality Highly original

AI Analysis

For developers deploying safety-aligned LLMs, this work reveals that standard safety benchmarks and action-level guardrails are insufficient for detecting context-dependent failures, motivating state-aware safety mechanisms.

Aligned language models exhibit brittle safety: they rigidly follow safety rules even when context changes make the previously safe action harmful. Across 12 models, a safety-commonsense gap of 17.4 percentage points was found, with brittleness rates varying from 13.7% to 90.0% among high-accuracy models.

Safety benchmark scores provide incomplete evidence of deployment readiness: aligned language models often adhere to rigid rules even when a situational update flips which action is safe. We term this failure brittle safety. To diagnose it, we introduce context-flip evaluation, testing 12 models across a safety benchmark (PacifAIst) and two commonsense controls using paired variants where the nominally safe action produces harm. Three findings emerge. First, brittle safety is safety-specific: all 12 models exhibit a safety-commonsense gap (mean +17.4 pp). Baseline accuracy fails to predict brittleness: among models above 90% baseline accuracy, brittleness rates range from 13.7% to 90.0%. Second, failures stem from policy override rather than miscomprehension: despite acknowledging the context change in every case, models persist via three distinct mechanisms that vary by update type and model family. Third, on a hand-audited probe of catastrophic consequence-flip scenarios, standard action-level guardrails catch none, while a state-aware validator catches all without false alarms on correct interventions. This indicates that action-level content moderation is systematically blind to consequence-flips, motivating state-aware architectural alternatives. We release our protocol, perturbed benchmarks, and deployment probe.

View on arXiv PDF

Similar