CVAIMar 25

Revealing Multi-View Hallucination in Large Vision-Language Models

arXiv:2603.2393482.8h-index: 45
AI Analysis

This addresses a critical issue for users of multi-view image applications, such as robotics or surveillance, by enhancing model reliability, though it is incremental as it builds on existing hallucination mitigation approaches.

The paper tackles the problem of multi-view hallucination in large vision-language models, where models confuse visual information from different instances or viewpoints, and proposes a training-free decoding technique that improves performance by up to 34.6 points over existing methods.

Large vision-language models (LVLMs) are increasingly being applied to multi-view image inputs captured from diverse viewpoints. However, despite this growing use, current LVLMs often confuse or mismatch visual information originating from different instances or viewpoints, a phenomenon we term multi-view hallucination. To systematically analyze this problem, we construct MVH-Bench, a benchmark comprising 4.8k question-answer pairs targeting two types of hallucination: cross-instance and cross-view. Empirical results show that recent LVLMs struggle to correctly associate visual evidence with its corresponding instance or viewpoint. To overcome this limitation, we propose Reference Shift Contrastive Decoding (RSCD), a training-free decoding technique that suppresses visual interference by generating negative logits through attention masking. Experiments on MVH-Bench with Qwen2.5-VL and LLaVA-OneVision demonstrate that RSCD consistently improves performance by up to 21.1 and 34.6 points over existing hallucination mitigation methods, highlighting the effectiveness of our approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes