CRSEMay 15

Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs

arXiv:2605.1559869.4
Predicted impact top 17% in CR · last 90 daysOriginality Incremental advance
AI Analysis

For AI safety researchers, it reveals nuanced dynamics of adversarial prompt composition and structural properties of safety alignment not apparent from single-strategy evaluations.

This paper systematically studies mutator chaining in jailbreaking LLMs, finding that most combinations fail to outperform individual mutators, but a small fraction produce synergistic effects that improve attack success rates.

Jailbreaking attacks on large language models pose a significant threat to AI safety by enabling the generation of harmful or restricted content. While prior work has explored both handcrafted and automated jailbreak strategies, the potential for compositional interaction between simple attacks remains underexplored. This paper presents a systematic study of mutator chaining, in which weak jailbreak transformations are applied sequentially to characterize how they interact: whether they reinforce one another, interfere destructively, or produce no meaningful change. We implement twelve baseline mutators and evaluate all ordered pairs on a benchmark of harmful prompts against three popular LLM models. Our framework introduces metrics for completeness and validity that capture both transformation persistence and attack effectiveness. Results reveal that the interaction landscape is highly non-uniform, while most combinations fail to outperform individual mutators, exhibiting destructive interference or structural incompatibility, a small fraction produce synergistic effects that improve attack success rates. Equally important, the prevalent failure modes reveal structural properties of safety alignment that are not apparent from single-strategy evaluations. These findings highlight the nuanced dynamics of adversarial prompt composition and offer new insights for building more robust safety defenses.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes