CVFeb 10

Attention to details, logits to truth: visual-aware attention and logits enhancement to mitigate hallucinations in LVLMs

arXiv:2602.09521v1h-index: 1
Originality Incremental advance
AI Analysis

This addresses hallucinations in LVLMs, which is a critical issue for reliable multimodal AI applications, though it is an incremental improvement over existing attention-based methods.

The paper tackles the problem of hallucinations in Large Vision-Language Models (LVLMs) by proposing a training-free attentional intervention algorithm that enhances attention to task-relevant visual tokens and injects visual attention into decoding, resulting in significant reduction of hallucinations while preserving accuracy and coherence.

Existing Large Vision-Language Models (LVLMs) exhibit insufficient visual attention, leading to hallucinations. To alleviate this problem, some previous studies adjust and amplify visual attention. These methods present a limitation that boosting attention for all visual tokens inevitably increases attention to task irrelevant tokens. To tackle this challenge, we propose a training free attentional intervention algorithm to enhance the attention of task-relevant tokens based on the argument that task-relevant tokens generally demonstrate high visual-textual similarities. Specifically, the vision-text cross-attention submatrices, which represent visual-textual correlations, are extracted to construct the reweighting matrices to reallocate attention. Besides, to enhance the contribution of visual tokens, we inject visual attention values into the beam search decoding to identify solutions with higher visual attention. Extensive experiments demonstrate that this method significantly reduces hallucinations across mainstream LVLMs, while preserving the accuracy and coherence of generated content.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes