CVAIMay 23

Correcting Visual Blur Induced by Attention Distraction to Reduce Hallucinations: Algorithm and Theory

arXiv:2605.2460252.9
AI Analysis

For researchers and practitioners using multimodal LLMs, this work provides a theoretically grounded, training-free solution to reduce object hallucinations, a critical bottleneck in reliable visual understanding.

The paper identifies that object hallucinations in multimodal large language models are linked to an attention distraction phenomenon, similar to human divided focus, and proposes AFIP, a training-free method that corrects this distraction via cross-head attention enrichment and dynamic historical attention enhancement, achieving significant hallucination reduction across benchmarks.

Multimodal large language models (MLLMs) frequently suffer from object hallucinations, yet the visual perceptual mechanism underlying this failure remains poorly understood. In this work, we reveal that hallucinations are strongly associated with a human-like attention distraction phenomenon, where humans under divided focus experience degraded visual clarity and produce inaccurate descriptions, while in models the same mechanism manifests as spatial inconsistency in multi-head attention and temporal fading of attention to image tokens during decoding. We further provide theoretical insights that attention dispersion increases model complexity and degrades classification generalization. Motivated by these findings, we propose an Attention-Focused Approach for Improved Image Perception (AFIP), which corrects attention distraction via cross-head attention enrichment and reinforces visual grounding through dynamic historical attention enhancement. Extensive experiments on multiple benchmarks and models validate the effectiveness of AFIP without additional training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes