CVAug 19, 2025

Enhancing Targeted Adversarial Attacks on Large Vision-Language Models via Intermediate Projector

Yiming Cao, Yanjie Li, Kaisheng Liang, Bin Xiao

arXiv:2508.13739v22 citationsh-index: 5

Originality Incremental advance

AI Analysis

This addresses safety concerns for deployed vision-language models by improving targeted adversarial attacks, representing an incremental advance over existing methods.

The paper tackles the problem of targeted adversarial attacks on large vision-language models by proposing a novel black-box attack framework that leverages the intermediate projector (Q-Former) to enhance attack effectiveness and granularity. Results show that their method significantly outperforms baselines in global targeted attacks and achieves superior success rates with better content preservation in fine-grained attacks, with effective transfer to commercial models like Google Gemini and OpenAI GPT.

The growing deployment of Large Vision-Language Models (VLMs) raises safety concerns, as adversaries may exploit model vulnerabilities to induce harmful outputs, with targeted black-box adversarial attacks posing a particularly severe threat. However, existing methods primarily maximize encoder-level global similarity, which lacks the granularity for stealthy and practical fine-grained attacks, where only specific target should be altered (e.g., modifying a car while preserving its background). Moreover, they largely neglect the projector, a key semantic bridge in VLMs for multimodal alignment. To address these limitations, we propose a novel black-box targeted attack framework that leverages the projector. Specifically, we utilize the widely adopted Querying Transformer (Q-Former) which transforms global image embeddings into fine-grained query outputs, to enhance attack effectiveness and granularity. For standard global targeted attack scenarios, we propose the Intermediate Projector Guided Attack (IPGA), which aligns Q-Former fine-grained query outputs with the target to enhance attack strength and exploits the intermediate pretrained Q-Former that is not fine-tuned for any specific Large Language Model (LLM) to improve attack transferability. For fine-grained attack scenarios, we augment IPGA with the Residual Query Alignment (RQA) module, which preserves unrelated content by constraining non-target query outputs to enhance attack granularity. Extensive experiments demonstrate that IPGA significantly outperforms baselines in global targeted attacks, and IPGA with RQA (IPGA-R) attains superior success rates and unrelated content preservation over baselines in fine-grained attacks. Our method also transfers effectively to commercial VLMs such as Google Gemini and OpenAI GPT.

View on arXiv PDF

Similar