CVJun 30, 2025

CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models

arXiv:2506.23590v14 citationsh-index: 28
Originality Incremental advance
AI Analysis

This addresses the problem of inaccurate visual content generation in LVLMs for users relying on these models, representing an incremental improvement over existing methods.

The paper tackles object hallucination in Large Vision-Language Models (LVLMs) by proposing CAI, a training-free method that uses caption-sensitive attention intervention, achieving state-of-the-art performance with minimal inference cost across four benchmarks.

Although Large Vision-Language Models (LVLMs) have demonstrated powerful capabilities in interpreting visual information, they frequently produce content that deviates from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or significantly increase inference time. In this work, we observe that LVLMs' attention to visual information is significantly stronger when answering caption queries compared to non-caption queries. Inspired by this phenomenon, we propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play hallucination mitigation method that leverages the attention activation pattern in response to caption queries to enhance LVLMs' visual perception capability. Extensive experimental results across four benchmarks covering both discriminative and generative tasks, demonstrate that CAI achieves state-of-the-art (SOTA) hallucination mitigating performance only with minimal additional inference cost.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes