Diving into Mitigating Hallucinations from a Vision Perspective for Large Vision-Language Models
This addresses a critical reliability issue for real-world applications of LVLMs, though it is incremental as it builds on existing encoder analysis and fusion methods.
The paper tackles object hallucination in Large Vision-Language Models by analyzing how different visual encoders cause diverse hallucination patterns, and it introduces VisionWeaver, a Context-Aware Routing Network that reduces hallucinations and improves performance, as validated on the new VHBench-10 benchmark with 10,000 samples.
Object hallucination in Large Vision-Language Models (LVLMs) significantly impedes their real-world applicability. As the primary component for accurately interpreting visual information, the choice of visual encoder is pivotal. We hypothesize that the diverse training paradigms employed by different visual encoders instill them with distinct inductive biases, which leads to their diverse hallucination performances. Existing benchmarks typically focus on coarse-grained hallucination detection and fail to capture the diverse hallucinations elaborated in our hypothesis. To systematically analyze these effects, we introduce VHBench-10, a comprehensive benchmark with approximately 10,000 samples for evaluating LVLMs across ten fine-grained hallucination categories. Our evaluations confirm encoders exhibit unique hallucination characteristics. Building on these insights and the suboptimality of simple feature fusion, we propose VisionWeaver, a novel Context-Aware Routing Network. It employs global visual features to generate routing signals, dynamically aggregating visual features from multiple specialized experts. Comprehensive experiments confirm the effectiveness of VisionWeaver in significantly reducing hallucinations and improving overall model performance.