Jailbreak Transferability Emerges from Shared Representations
This addresses the security and robustness of AI models for developers and users by providing a foundational understanding of jailbreak vulnerabilities, though it is incremental in refining existing knowledge.
The study tackled the problem of jailbreak transferability in AI models by investigating its causes, finding that it emerges from shared representations rather than incidental flaws, with factors like representational similarity and attack strength shaping transfer, and showing that increasing similarity through distillation causally boosts transfer.
Jailbreak transferability is the surprising phenomenon when an adversarial attack compromising one model also elicits harmful responses from other models. Despite widespread demonstrations, there is little consensus on why transfer is possible: is it a quirk of safety training, an artifact of model families, or a more fundamental property of representation learning? We present evidence that transferability emerges from shared representations rather than incidental flaws. Across 20 open-weight models and 33 jailbreak attacks, we find two factors that systematically shape transfer: (1) representational similarity under benign prompts, and (2) the strength of the jailbreak on the source model. To move beyond correlation, we show that deliberately increasing similarity through benign only distillation causally increases transfer. Our qualitative analyses reveal systematic transferability patterns across different types of jailbreaks. For example, persona-style jailbreaks transfer far more often than cipher-based prompts, consistent with the idea that natural-language attacks exploit models' shared representation space, whereas cipher-based attacks rely on idiosyncratic quirks that do not generalize. Together, these results reframe jailbreak transfer as a consequence of representation alignment rather than a fragile byproduct of safety training.