CRAILGJul 12, 2024

Self-interpreting Adversarial Images

arXiv:2407.08970v45 citationsh-index: 7
AI Analysis

This addresses security vulnerabilities in AI systems for users and developers, enabling harmful content like misinformation, but it is incremental as it builds on existing prompt injection concepts.

The paper tackles the problem of indirect, cross-modal injection attacks on visual language models by introducing self-interpreting images with hidden meta-instructions, which control model outputs to express adversary-chosen styles or viewpoints while maintaining plausible answers based on visual content.

We introduce a new type of indirect, cross-modal injection attacks against visual language models that enable creation of self-interpreting images. These images contain hidden "meta-instructions" that control how models answer users' questions about the image and steer models' outputs to express an adversary-chosen style, sentiment, or point of view. Self-interpreting images act as soft prompts, conditioning the model to satisfy the adversary's (meta-)objective while still producing answers based on the image's visual content. Meta-instructions are thus a stronger form of prompt injection. Adversarial images look natural and the model's answers are coherent and plausible, yet they also follow the adversary-chosen interpretation, e.g., political spin, or even objectives that are not achievable with explicit text instructions. We evaluate the efficacy of self-interpreting images for a variety of models, interpretations, and user prompts. We describe how these attacks could cause harm by enabling creation of self-interpreting content that carries spam, misinformation, or spin. Finally, we discuss defenses.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes