CRAICVApr 13, 2025

The Structural Safety Generalization Problem

arXiv:2504.09712v22 citationsh-index: 11ACL
Originality Highly original
AI Analysis

This addresses a critical intermediate challenge for AI safety research by making jailbreak analysis more tractable through structured attacks and defenses.

The paper tackles LLM jailbreak vulnerabilities by focusing on safety generalization failures across semantically equivalent inputs, demonstrating that multi-turn, multi-image, and translation-based attacks yield different safety outcomes than their simpler counterparts. It proposes a Structure Rewriting Guardrail that significantly improves refusal of harmful inputs without over-refusing benign ones.

LLM jailbreaks are a widespread safety challenge. Given this problem has not yet been tractable, we suggest targeting a key failure mechanism: the failure of safety to generalize across semantically equivalent inputs. We further focus the target by requiring desirable tractability properties of attacks to study: explainability, transferability between models, and transferability between goals. We perform red-teaming within this framework by uncovering new vulnerabilities to multi-turn, multi-image, and translation-based attacks. These attacks are semantically equivalent by our design to their single-turn, single-image, or untranslated counterparts, enabling systematic comparisons; we show that the different structures yield different safety outcomes. We then demonstrate the potential for this framework to enable new defenses by proposing a Structure Rewriting Guardrail, which converts an input to a structure more conducive to safety assessment. This guardrail significantly improves refusal of harmful inputs, without over-refusing benign ones. Thus, by framing this intermediate challenge - more tractable than universal defenses but essential for long-term safety - we highlight a critical milestone for AI safety research.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes