Self-interpreting Adversarial Images
This addresses security vulnerabilities in AI systems for users and developers, enabling harmful content like misinformation, but it is incremental as it builds on existing prompt injection concepts.
The paper tackles the problem of indirect, cross-modal injection attacks on visual language models by introducing self-interpreting images with hidden meta-instructions, which control model outputs to express adversary-chosen styles or viewpoints while maintaining plausible answers based on visual content.
We introduce a new type of indirect, cross-modal injection attacks against visual language models that enable creation of self-interpreting images. These images contain hidden "meta-instructions" that control how models answer users' questions about the image and steer models' outputs to express an adversary-chosen style, sentiment, or point of view. Self-interpreting images act as soft prompts, conditioning the model to satisfy the adversary's (meta-)objective while still producing answers based on the image's visual content. Meta-instructions are thus a stronger form of prompt injection. Adversarial images look natural and the model's answers are coherent and plausible, yet they also follow the adversary-chosen interpretation, e.g., political spin, or even objectives that are not achievable with explicit text instructions. We evaluate the efficacy of self-interpreting images for a variety of models, interpretations, and user prompts. We describe how these attacks could cause harm by enabling creation of self-interpreting content that carries spam, misinformation, or spin. Finally, we discuss defenses.