CVSep 24, 2025

JaiLIP: Jailbreaking Vision-Language Models via Loss Guided Image Perturbation

arXiv:2509.21401v21 citationsh-index: 11
Originality Incremental advance
AI Analysis

This addresses safety alignment concerns for vision-language models, demonstrating practical image-based jailbreak attacks that could enable misuse, though it is incremental over existing perturbation techniques.

The authors tackled the problem of jailbreaking vision-language models via image perturbations, proposing JaiLIP which minimizes a joint objective combining MSE loss and harmful-output loss, resulting in highly effective and imperceptible adversarial images that outperform existing methods in toxicity metrics.

Vision-Language Models (VLMs) have remarkable abilities in generating multimodal reasoning tasks. However, potential misuse or safety alignment concerns of VLMs have increased significantly due to different categories of attack vectors. Among various attack vectors, recent studies have demonstrated that image-based perturbations are particularly effective in generating harmful outputs. In the literature, many existing techniques have been proposed to jailbreak VLMs, leading to unstable performance and visible perturbations. In this study, we propose Jailbreaking with Loss-guided Image Perturbation (JaiLIP), a jailbreaking attack in the image space that minimizes a joint objective combining the mean squared error (MSE) loss between clean and adversarial image with the models harmful-output loss. We evaluate our proposed method on VLMs using standard toxicity metrics from Perspective API and Detoxify. Experimental results demonstrate that our method generates highly effective and imperceptible adversarial images, outperforming existing methods in producing toxicity. Moreover, we have evaluated our method in the transportation domain to demonstrate the attacks practicality beyond toxic text generation in specific domain. Our findings emphasize the practical challenges of image-based jailbreak attacks and the need for efficient defense mechanisms for VLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes