Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT
This work addresses security vulnerabilities in aligned AI models, posing risks for deployment in sensitive applications, but it is incremental as it builds on existing attack methods like IDEATOR.
The paper tackles the problem of breaking safety alignment in Reasoning-augmented Vision-Language Models (RVLMs) by introducing Stealth Fine-Tuning, which uses self-generated chain-of-thought traces to bypass defenses, achieving a 38.52% higher attack success rate than IDEATOR with only 499 samples and under 3 hours of training.
Reasoning-augmented Vision-Language Models (RVLMs) rely on safety alignment to prevent harmful behavior, yet their exposed chain-of-thought (CoT) traces introduce new attack surfaces. In this work, we find that the safety alignment of RVLMs can be easily break through a novel attack method termed \textbf{Stealth Fine-Tuning}. Our method elicits harmful reasoning traces through \textbf{segment-level interference} and reuses the self-generated outputs as supervised fine-tuning data. Through a \textbf{turn-based weighted} loss design, yielding a lightweight, distribution-consistent finetuning method. In our experiment, with only 499 samples and under 3 hours on a single A100 (QLoRA), Stealth Fine-Tuning outperforms IDEATOR by 38.52\% ASR while preserving general reasoning ability, as the tuned model retains the original representation distribution. Experiments on AdvBench and several general benchmarks demonstrate that Stealth Fine-Tuning is a low-cost and highly effective way to bypass alignment defenses. \textcolor{red}{\textbf{Disclaimer: This paper contains content that may be disturbing or offensive.}}