CRAIOct 9, 2025

VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands

arXiv:2510.09699v11 citationsh-index: 2
Originality Incremental advance
AI Analysis

This exposes a critical security flaw in VLMs, which are widely used for multimodal tasks, highlighting the need for robust defenses against image-based attacks.

The paper tackles the vulnerability of Vision-Language Models (VLMs) to jailbreak attacks by introducing VisualDAN, a single adversarial image that bypasses safeguards in models like MiniGPT-4 and LLaVA, forcing them to produce harmful outputs that violate ethical standards.

Vision-Language Models (VLMs) have garnered significant attention for their remarkable ability to interpret and generate multimodal content. However, securing these models against jailbreak attacks continues to be a substantial challenge. Unlike text-only models, VLMs integrate additional modalities, introducing novel vulnerabilities such as image hijacking, which can manipulate the model into producing inappropriate or harmful responses. Drawing inspiration from text-based jailbreaks like the "Do Anything Now" (DAN) command, this work introduces VisualDAN, a single adversarial image embedded with DAN-style commands. Specifically, we prepend harmful corpora with affirmative prefixes (e.g., "Sure, I can provide the guidance you need") to trick the model into responding positively to malicious queries. The adversarial image is then trained on these DAN-inspired harmful texts and transformed into the text domain to elicit malicious outputs. Extensive experiments on models such as MiniGPT-4, MiniGPT-v2, InstructBLIP, and LLaVA reveal that VisualDAN effectively bypasses the safeguards of aligned VLMs, forcing them to execute a broad range of harmful instructions that severely violate ethical standards. Our results further demonstrate that even a small amount of toxic content can significantly amplify harmful outputs once the model's defenses are compromised. These findings highlight the urgent need for robust defenses against image-based attacks and offer critical insights for future research into the alignment and security of VLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes