CRLGFeb 13, 2025

Jailbreak Attack Initializations as Extractors of Compliance Directions

arXiv:2502.09755v37 citationsh-index: 14Has CodeEMNLP
Originality Incremental advance
AI Analysis

This work addresses the fragility of safety-aligned LLMs for security researchers, though it is incremental as it builds on existing gradient-based jailbreak methods.

The paper tackles the problem of understanding why certain initialization methods improve jailbreak attacks on safety-aligned LLMs, revealing that attacks converge to a compliance direction that suppresses refusal. It proposes CRI, an initialization framework that boosts attack success rates and reduces computational overhead across multiple models and datasets.

Safety-aligned LLMs respond to prompts with either compliance or refusal, each corresponding to distinct directions in the model's activation space. Recent works show that initializing attacks via self-transfer from other prompts significantly enhances their performance. However, the underlying mechanisms of these initializations remain unclear, and attacks utilize arbitrary or hand-picked initializations. This work presents that each gradient-based jailbreak attack and subsequent initialization gradually converge to a single compliance direction that suppresses refusal, thereby enabling an efficient transition from refusal to compliance. Based on this insight, we propose CRI, an initialization framework that aims to project unseen prompts further along compliance directions. We demonstrate our approach on multiple attacks, models, and datasets, achieving an increased attack success rate (ASR) and reduced computational overhead, highlighting the fragility of safety-aligned LLMs. A reference implementation is available at: https://amit1221levi.github.io/CRI-Jailbreak-Init-LLMs-evaluation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes