CVNov 6, 2024

H-POPE: Hierarchical Polling-based Probing Evaluation of Hallucinations in Large Vision-Language Models

arXiv:2411.04077v111.38 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

This addresses the issue of unreliable outputs in multi-modal AI systems for users relying on accurate visual-textual integration, though it is incremental as it focuses on evaluation rather than solving hallucinations.

The paper tackles the problem of hallucinations in large vision-language models by proposing H-POPE, a benchmark for assessing object existence and attribute inconsistencies, finding that models are prone to hallucinations, especially on fine-grained attributes.

By leveraging both texts and images, large vision language models (LVLMs) have shown significant progress in various multi-modal tasks. Nevertheless, these models often suffer from hallucinations, e.g., they exhibit inconsistencies between the visual input and the textual output. To address this, we propose H-POPE, a coarse-to-fine-grained benchmark that systematically assesses hallucination in object existence and attributes. Our evaluation shows that models are prone to hallucinations on object existence, and even more so on fine-grained attributes. We further investigate whether these models rely on visual input to formulate the output texts.

View on arXiv PDF

Similar