CVLGMay 27

Structure-Guided Visual Perturbation Neutralization for LVLMs

arXiv:2605.2792771.5h-index: 4Has Code
Predicted impact top 41% in CV · last 90 daysOriginality Incremental advance
AI Analysis

It addresses the vulnerability of LVLMs to pixel-level adversarial attacks with a computationally efficient defense that maintains cross-modal alignment.

The paper proposes SIGN, a lightweight defense against adversarial perturbations for LVLMs that achieves over 87% defense success rate with only 0.5% pixel modification and 0.16 seconds per image, while preserving benign task performance.

Image inputs enable Large Vision Language Models (LVLMs) to perceive fine-grained visual information, but also introduce a pixel-level attack surface through which adversarial perturbations can elicit unsafe model behaviors. However, most existing defenses are designed for traditional computer vision settings and thus often overlook the cross-modal alignment required by LVLMs, leading to degraded performance. Meanwhile, the limited defenses tailored to LVLMs often require substantial image modifications and introduce considerable computational overhead, thereby compromising inference quality and efficiency. To address these limitations, we propose Structure-Induced Guided Neutralization (SIGN), a lightweight, plug-and-play defense framework that improves LVLM compatibility via Prior Structural Extraction and achieves efficient perturbation suppression via Dynamic Guided Neutralization. Extensive experiments show that SIGN achieves over 87\% defense success rate with only 0.5\% pixel modification and 0.16 seconds per image, while nearly preserving original visual representations and benign task performance. Our work offers a lightweight alternative to defenses that require costly model training and highlights the potential of exploiting a vision encoder for efficient adversarial protection. Our code is open source on https://anonymous.4open.science/r/SIGN-BCB1.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes