Attention-space Contrastive Guidance for Efficient Hallucination Mitigation in LVLMs

arXiv:2601.13707v12.81 citationsh-index: 2

Originality Incremental advance

AI Analysis

This addresses the problem of hallucinations in LVLMs for users needing reliable vision-language outputs, offering an efficient solution with incremental improvements in computational cost.

The paper tackled hallucinations in large vision-language models by proposing Attention-space Contrastive Guidance (ACG), a single-pass method that reduces over-reliance on language priors, achieving state-of-the-art faithfulness and caption quality on benchmarks like CHAIR and POPE while cutting latency by up to 2x compared to prior methods.

Hallucinations in large vision-language models (LVLMs) often arise when language priors dominate over visual evidence, causing object misidentification and visually inconsistent descriptions. We address this issue by framing hallucination mitigation as contrastive guidance, steering generation toward visually grounded and semantically faithful text. This approach regulates the model's internal behavior by reducing over-dependence on language priors and contrasting visually grounded with language-only representations. We propose Attention-space Contrastive Guidance (ACG), a single-pass mechanism that operates within self-attention layers to construct both vision-language and language-only attention paths in a single forward computation. This integration enables computationally efficient guidance directly embedded in the model's representation contextualization. To correct approximation bias introduced by the single-pass formulation, we further apply an orthogonalized correction that removes components aligned with the language-only path, selectively amplifying visual contributions. Experiments on the CHAIR and POPE benchmarks show that ACG achieves state-of-the-art faithfulness and caption quality while significantly reducing computational cost. Our method establishes a principled and efficient alternative, reducing latency by up to 2x compared to prior contrastive decoding methods that require multiple forward passes.

View on arXiv PDF

Similar