CVMar 15

Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining

arXiv:2603.1421941.6h-index: 1
AI Analysis

This addresses the issue of jailbreak resistance in VLMs for users relying on safety prompts, offering an incremental improvement through structural intervention without retraining.

The paper tackles the problem of enhancing safety prompts against jailbreak attacks in vision-language models by introducing Safety-Potential Pruning, a one-shot pruning framework that reduces attack success rates by up to 22% relative to prompting alone while maintaining benign performance.

Safety prompts constitute an interpretable layer of defense against jailbreak attacks in vision-language models (VLMs); however, their efficacy is constrained by the models' latent structural responsiveness. We observe that such prompts consistently engage a sparse set of parameters that remain largely quiescent during benign use. This finding motivates the Safety Subnetwork Hypothesis: VLMs embed structurally distinct pathways capable of enforcing safety, but these pathways remain dormant without explicit stimulation. To expose and amplify these pathways, we introduce Safety-Potential Pruning, a one-shot pruning framework that amplifies safety-relevant activations by removing weights that are less responsive to safety prompts without additional retraining. Across three representative VLM architectures and three jailbreak benchmarks, our method reduces attack success rates by up to 22% relative to prompting alone, all while maintaining strong benign performance. These findings frame pruning not only as a model compression technique, but as a structural intervention to emerge alignment-relevant subnets, offering a new path to robust jailbreak resistance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes