HauntAttack: When Attack Follows Reasoning as a Shadow
This addresses a critical safety problem for developers and users of reasoning-based AI models, highlighting an urgent challenge in balancing reasoning capability and safety, though it is incremental as it builds on existing adversarial attack methods.
The paper tackles the vulnerability of Large Reasoning Models (LRMs) to jailbreaks by introducing HauntAttack, a black-box adversarial attack framework that embeds harmful instructions into reasoning questions, achieving an average attack success rate of 70% across 11 LRMs with up to 12 percentage points improvement over prior baselines.
Emerging Large Reasoning Models (LRMs) consistently excel in mathematical and reasoning tasks, showcasing remarkable capabilities. However, the enhancement of reasoning abilities and the exposure of internal reasoning processes introduce new safety vulnerabilities. A critical question arises: when reasoning becomes intertwined with harmfulness, will LRMs become more vulnerable to jailbreaks in reasoning mode? To investigate this, we introduce HauntAttack, a novel and general-purpose black-box adversarial attack framework that systematically embeds harmful instructions into reasoning questions. Specifically, we modify key reasoning conditions in existing questions with harmful instructions, thereby constructing a reasoning pathway that guides the model step by step toward unsafe outputs. We evaluate HauntAttack on 11 LRMs and observe an average attack success rate of 70\%, achieving up to 12 percentage points of absolute improvement over the strongest prior baseline. Our further analysis reveals that even advanced safety-aligned models remain highly susceptible to reasoning-based attacks, offering insights into the urgent challenge of balancing reasoning capability and safety in future model development.