CVMar 5

AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM

arXiv:2603.04908v11 citations
Originality Highly original
AI Analysis

This work provides a method to reduce hallucinations in LVLMs, which is a critical problem for the reliability and applicability of these models for users relying on accurate visual descriptions.

This paper addresses the problem of hallucinations in Large Vision-Language Models (LVLMs) by proposing AdaIAT, a method that adaptively increases attention to generated text. This approach significantly reduces hallucination rates on LLaVA-1.5 by 35.8% ($C_S$) and 37.1% ($C_I$) while maintaining linguistic coherence and prediction capabilities.

Hallucination has been a significant impediment to the development and application of current Large Vision-Language Models (LVLMs). To mitigate hallucinations, one intuitive and effective way is to directly increase attention weights to image tokens during inference. Although this effectively reduces the hallucination rate, it often induces repetitive descriptions. To address this, we first conduct an analysis of attention patterns and reveal that real object tokens tend to assign higher attention to the generated text than hallucinated ones. This inspires us to leverage the generated text, which contains instruction-related visual information and contextual knowledge, to alleviate hallucinations while maintaining linguistic coherence. We therefore propose Attention to Generated Text (IAT) and demonstrate that it significantly reduces the hallucination rate while avoiding repetitive descriptions. To prevent naive amplification from impairing the inherent prediction capabilities of LVLMs, we further explore Adaptive IAT (AdaIAT) that employs a layer-wise threshold to control intervention time and fine-grained amplification magnitude tailored to the characteristics of each attention head. Both analysis and experiments demonstrate the effectiveness of AdaIAT. Results of several LVLMs show that AdaIAT effectively alleviates hallucination (reducing hallucination rates $C_S$ and $C_I$ on LLaVA-1.5 by 35.8% and 37.1%, respectively) while preserving linguistic performance and prediction capability, achieving an attractive trade-off.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes