CVCLLGFeb 18, 2025

Understanding and Rectifying Safety Perception Distortion in VLMs

arXiv:2502.13095v110 citationsh-index: 41
Originality Incremental advance
AI Analysis

This addresses a safety issue for users of VLMs, but it is incremental as it builds on known vulnerabilities in multimodal models.

The paper tackles the problem of vision-language models (VLMs) becoming more vulnerable to harmful requests after integrating vision, identifying a modality-induced activation shift that causes safety perception distortion. The result is that their proposed method, ShiftDC, significantly enhances safety alignment on benchmarks without impairing utility.

Recent studies reveal that vision-language models (VLMs) become more susceptible to harmful requests and jailbreak attacks after integrating the vision modality, exhibiting greater vulnerability than their text-only LLM backbones. To uncover the root cause of this phenomenon, we conduct an in-depth analysis and identify a key issue: multimodal inputs introduce an modality-induced activation shift toward a "safer" direction compared to their text-only counterparts, leading VLMs to systematically overestimate the safety of harmful inputs. We refer to this issue as safety perception distortion. To mitigate such distortion, we propose Activation Shift Disentanglement and Calibration (ShiftDC), a training-free method that decomposes and calibrates the modality-induced activation shift to reduce the impact of modality on safety. By isolating and removing the safety-relevant component, ShiftDC restores the inherent safety alignment of the LLM backbone while preserving the vision-language capabilities of VLMs. Empirical results demonstrate that ShiftDC significantly enhances alignment performance on safety benchmarks without impairing model utility.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes