CVNov 6, 2024

H-POPE: Hierarchical Polling-based Probing Evaluation of Hallucinations in Large Vision-Language Models

arXiv:2411.04077v18 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This addresses the issue of unreliable outputs in multi-modal AI systems for users relying on accurate visual-textual integration, though it is incremental as it focuses on evaluation rather than solving hallucinations.

The paper tackles the problem of hallucinations in large vision-language models by proposing H-POPE, a benchmark for assessing object existence and attribute inconsistencies, finding that models are prone to hallucinations, especially on fine-grained attributes.

By leveraging both texts and images, large vision language models (LVLMs) have shown significant progress in various multi-modal tasks. Nevertheless, these models often suffer from hallucinations, e.g., they exhibit inconsistencies between the visual input and the textual output. To address this, we propose H-POPE, a coarse-to-fine-grained benchmark that systematically assesses hallucination in object existence and attributes. Our evaluation shows that models are prone to hallucinations on object existence, and even more so on fine-grained attributes. We further investigate whether these models rely on visual input to formulate the output texts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes