CVAILGJan 20

Attention-space Contrastive Guidance for Efficient Hallucination Mitigation in LVLMs

arXiv:2601.13707v11 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses the problem of hallucinations in LVLMs for users needing reliable vision-language outputs, offering an efficient solution with incremental improvements in computational cost.

The paper tackled hallucinations in large vision-language models by proposing Attention-space Contrastive Guidance (ACG), a single-pass method that reduces over-reliance on language priors, achieving state-of-the-art faithfulness and caption quality on benchmarks like CHAIR and POPE while cutting latency by up to 2x compared to prior methods.

Hallucinations in large vision-language models (LVLMs) often arise when language priors dominate over visual evidence, causing object misidentification and visually inconsistent descriptions. We address this issue by framing hallucination mitigation as contrastive guidance, steering generation toward visually grounded and semantically faithful text. This approach regulates the model's internal behavior by reducing over-dependence on language priors and contrasting visually grounded with language-only representations. We propose Attention-space Contrastive Guidance (ACG), a single-pass mechanism that operates within self-attention layers to construct both vision-language and language-only attention paths in a single forward computation. This integration enables computationally efficient guidance directly embedded in the model's representation contextualization. To correct approximation bias introduced by the single-pass formulation, we further apply an orthogonalized correction that removes components aligned with the language-only path, selectively amplifying visual contributions. Experiments on the CHAIR and POPE benchmarks show that ACG achieves state-of-the-art faithfulness and caption quality while significantly reducing computational cost. Our method establishes a principled and efficient alternative, reducing latency by up to 2x compared to prior contrastive decoding methods that require multiple forward passes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes