CVMar 17, 2025

Web Artifact Attacks Disrupt Vision Language Models

Maan Qraitem, Piotr Teterwak, Kate Saenko, Bryan A. Plummer

arXiv:2503.13652v23 citationsh-index: 25Has Code

Originality Incremental advance

AI Analysis

This addresses a security and robustness issue for users of VLMs, such as in AI applications, by revealing a broader class of attacks beyond prior typographic methods, though it is incremental in extending existing attack vectors.

The paper tackles the problem of vision-language models (VLMs) being vulnerable to attacks that exploit unintended correlations learned from web data, introducing 'artifact-based' attacks using non-matching text and graphical elements, which achieve up to 100% success rates and transfer across models with up to 90% effectiveness.

Vision-language models (VLMs) (e.g. CLIP, LLaVA) are trained on large-scale, lightly curated web datasets, leading them to learn unintended correlations between semantic concepts and unrelated visual signals. These associations degrade model accuracy by causing predictions to rely on incidental patterns rather than genuine visual understanding. Prior work has weaponized these correlations as an attack vector to manipulate model predictions, such as inserting a deceiving class text onto the image in a "typographic" attack. These attacks succeed due to VLMs' text-heavy bias-a result of captions that echo visible words rather than describing content. However, this attack has focused solely on text that matches the target class exactly, overlooking a broader range of correlations, including non-matching text and graphical symbols, which arise from the abundance of branding content in web-scale data. To address this gap, we introduce "artifact-based" attacks: a novel class of manipulations that mislead models using both non-matching text and graphical elements. Unlike typographic attacks, these artifacts are not predefined, making them simultaneously harder to defend against and more challenging to find. We address this by framing artifact attacks as a search problem and demonstrate their effectiveness across five datasets, with some artifacts reinforcing each other to reach 100% attack success rates. These attacks transfer across models with up to 90% effectiveness, making it possible to attack unseen models. To defend against these attacks, we extend prior work's artifact aware prompting to the graphical setting. We see a moderate reduction of success rates of up to 15% relative to standard prompts, suggesting a promising direction for enhancing model robustness. Code: https://github.com/mqraitem/Web-Artifact-Attacks

View on arXiv PDF Code

Similar