CVJan 13

Semantic Misalignment in Vision-Language Models under Perceptual Degradation

arXiv:2601.08355v1
Originality Incremental advance
AI Analysis

This work addresses a critical limitation for safety-critical applications like autonomous driving, where unreliable perception can cause semantic misalignment, though it is incremental in highlighting an evaluation gap.

The study investigated the robustness of Vision-Language Models (VLMs) under realistic perception degradation, revealing that even moderate drops in segmentation metrics lead to severe failures like hallucinated objects and safety-critical omissions in downstream VLM behavior.

Vision-Language Models (VLMs) are increasingly deployed in autonomous driving and embodied AI systems, where reliable perception is critical for safe semantic reasoning and decision-making. While recent VLMs demonstrate strong performance on multimodal benchmarks, their robustness to realistic perception degradation remains poorly understood. In this work, we systematically study semantic misalignment in VLMs under controlled degradation of upstream visual perception, using semantic segmentation on the Cityscapes dataset as a representative perception module. We introduce perception-realistic corruptions that induce only moderate drops in conventional segmentation metrics, yet observe severe failures in downstream VLM behavior, including hallucinated object mentions, omission of safety-critical entities, and inconsistent safety judgments. To quantify these effects, we propose a set of language-level misalignment metrics that capture hallucination, critical omission, and safety misinterpretation, and analyze their relationship with segmentation quality across multiple contrastive and generative VLMs. Our results reveal a clear disconnect between pixel-level robustness and multimodal semantic reliability, highlighting a critical limitation of current VLM-based systems and motivating the need for evaluation frameworks that explicitly account for perception uncertainty in safety-critical applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes