CVOct 12, 2025

When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance

arXiv:2510.10466v11 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses hallucinations in VLMs for applications like image captioning or visual QA, but it is incremental as it builds on existing decoding methods without new training.

The paper tackled the problem of language bias-induced hallucinations in Vision-Language Models (VLMs), where models generate fluent but irrelevant responses to images, by introducing Cross-Modal Guidance (CMG), a training-free decoding method that reduces hallucinations by leveraging differences in output distributions with degraded visual-language attention, showing improved performance on hallucination-specific benchmarks.

Vision-Language Models (VLMs) have shown solid ability for multimodal understanding of both visual and language contexts. However, existing VLMs often face severe challenges of hallucinations, meaning that VLMs tend to generate responses that are only fluent in the language but irrelevant to images in previous contexts. To address this issue, we analyze how language bias contributes to hallucinations and then introduce Cross-Modal Guidance(CMG), a training-free decoding method that addresses the hallucinations by leveraging the difference between the output distributions of the original model and the one with degraded visual-language attention. In practice, we adaptively mask the attention weight of the most influential image tokens in selected transformer layers to corrupt the visual-language perception as a concrete type of degradation. Such a degradation-induced decoding emphasizes the perception of visual contexts and therefore significantly reduces language bias without harming the ability of VLMs. In experiment sections, we conduct comprehensive studies. All results demonstrate the superior advantages of CMG with neither additional conditions nor training costs. We also quantitatively show CMG can improve different VLM's performance on hallucination-specific benchmarks and generalize effectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes