Jailbreak Attack Initializations as Extractors of Compliance Directions
This work addresses the fragility of safety-aligned LLMs for security researchers, though it is incremental as it builds on existing gradient-based jailbreak methods.
The paper tackles the problem of understanding why certain initialization methods improve jailbreak attacks on safety-aligned LLMs, revealing that attacks converge to a compliance direction that suppresses refusal. It proposes CRI, an initialization framework that boosts attack success rates and reduces computational overhead across multiple models and datasets.
Safety-aligned LLMs respond to prompts with either compliance or refusal, each corresponding to distinct directions in the model's activation space. Recent works show that initializing attacks via self-transfer from other prompts significantly enhances their performance. However, the underlying mechanisms of these initializations remain unclear, and attacks utilize arbitrary or hand-picked initializations. This work presents that each gradient-based jailbreak attack and subsequent initialization gradually converge to a single compliance direction that suppresses refusal, thereby enabling an efficient transition from refusal to compliance. Based on this insight, we propose CRI, an initialization framework that aims to project unseen prompts further along compliance directions. We demonstrate our approach on multiple attacks, models, and datasets, achieving an increased attack success rate (ASR) and reduced computational overhead, highlighting the fragility of safety-aligned LLMs. A reference implementation is available at: https://amit1221levi.github.io/CRI-Jailbreak-Init-LLMs-evaluation.