CL CRApr 1

One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

Samee Arif, Naihao Deng, Zhijing Jin, Rada Mihalcea

arXiv:2604.2592174.61 citations

Predicted impact top 84% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For LLM safety researchers, this paper reveals a new vulnerability in conversational safety mechanisms, though the attack is incremental in nature.

The authors introduce Incremental Completion Decomposition (ICD), a jailbreak attack that elicits single-word continuations before the full response, achieving higher Attack Success Rates (ASR) on AdvBench, JailbreakBench, and StrongREJECT compared to existing methods.

Large Language Models (LLMs) are trained to refuse harmful requests, yet they remain vulnerable to jailbreak attacks that exploit weaknesses in conversational safety mechanisms. We introduce Incremental Completion Decomposition (ICD), a trajectory-based jailbreak strategy that elicits a sequence of single-word continuations related to a malicious request before eliciting the full response. In addition, we propose variants of ICD by manually picking or model-generating the one-word continuation, as well as prefilling when eliciting the full model response in the final step. We systematically evaluate these variants across a broad set of model families, demonstrating superior Attack Success Rate (ASR) on AdvBench, JailbreakBench, and StrongREJECT compared to existing methods. In addition, we provide a theoretical account of why ICD is effective and present mechanistic evidence that successful attack trajectories systematically suppress refusal-related representations and shift activations away from safety-aligned states.

View on arXiv PDF

Similar