CVAIApr 19

Cross-Modal Attention Analysis and Optimization in Vision-Language Models: A Study on Visual Reliability

arXiv:2604.172173.7h-index: 2
AI Analysis

For practitioners deploying VLMs, this work provides a method to diagnose and mitigate over-reliance on text, improving visual grounding.

The paper proposes an adversarial evaluation framework to quantify text shortcut learning in VLMs, and shows that an optimized LoRA fine-tuning approach reduces accuracy drop under conflicting text from 27.5% to 9.8% (64.4% relative improvement) while maintaining 97% normal accuracy.

Vision-Language Models (VLMs) achieve strong cross-modal performance, yet recent evidence suggests they over-rely on textual descriptions while under-utilizing visual evidence -- a phenomenon termed ``text shortcut learning.'' We propose an adversarial evaluation framework that quantifies this cross-modal dependency by measuring accuracy degradation (Drop) when semantically conflicting text is paired with unchanged images. Four adversarial strategies -- shape\_swap, color\_swap, position\_swap, and random\_text -- are applied to a controlled geometric-shapes dataset ($n{=}1{,}000$). We compare three configurations: Baseline CLIP (ViT-B/32), LoRA fine-tuning, and LoRA Optimized (integrating Hard Negative Mining, Label Smoothing, layer-wise learning rates, Cosine Restarts, curriculum learning, and data augmentation). The optimized model reduces average Drop from 27.5\% to 9.8\% (64.4\% relative improvement, $p{<}0.001$) while maintaining 97\% normal accuracy. Attention visualization and embedding-space analysis confirm that the optimized model attends more to visual features and achieves tighter cross-modal alignment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes