VALD: Multi-Stage Vision Attack Detection for Efficient LVLM Defense
This addresses security risks in LVLMs for applications like content moderation or autonomous systems, but it is incremental as it builds on existing defense strategies.
The paper tackles the vulnerability of Large Vision-Language Models (LVLMs) to adversarial images by introducing a training-free defense that uses multi-stage detection and agentic data consolidation to recover correct behavior, achieving state-of-the-art accuracy with minimal computational overhead.
Large Vision-Language Models (LVLMs) can be vulnerable to adversarial images that subtly bias their outputs toward plausible yet incorrect responses. We introduce a general, efficient, and training-free defense that combines image transformations with agentic data consolidation to recover correct model behavior. A key component of our approach is a two-stage detection mechanism that quickly filters out the majority of clean inputs. We first assess image consistency under content-preserving transformations at negligible computational cost. For more challenging cases, we examine discrepancies in a text-embedding space. Only when necessary do we invoke a powerful LLM to resolve attack-induced divergences. A key idea is to consolidate multiple responses, leveraging both their similarities and their differences. We show that our method achieves state-of-the-art accuracy while maintaining notable efficiency: most clean images skip costly processing, and even in the presence of numerous adversarial examples, the overhead remains minimal.