CV AIJun 24, 2024

Measuring the Measurers: Quality Evaluation of Hallucination Benchmarks for Large Vision-Language Models

Bei Yan, Jie Zhang, Zheng Yuan, Shiguang Shan, Xilin Chen

arXiv:2406.17115v312.113 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses a critical gap in evaluating hallucinations for LVLMs, which is essential for improving model safety and reliability in real-world applications, though it is incremental as it builds on existing benchmark efforts.

The paper tackles the problem of inconsistent and unreliable benchmarks for evaluating hallucination in Large Vision-Language Models by proposing a quality measurement framework (HQM) and a new benchmark (HQH), which reveals severe hallucination issues in popular models, including in both main answers and additional analysis.

Despite the outstanding performance in multimodal tasks, Large Vision-Language Models (LVLMs) have been plagued by the issue of hallucination, i.e., generating content that is inconsistent with the corresponding visual inputs. While previous works have proposed various benchmarks to evaluate this issue, the quality of these evaluations remains unverified. We observe that some of these benchmarks may produce inconsistent evaluation results across repeated tests or fail to align with human evaluation. To address this, we propose a Hallucination benchmark Quality Measurement framework (HQM), which leverages specific indicators to assess both reliability and validity. Our empirical analysis using HQM reveals and pinpoints potential evaluation issues in existing benchmarks, exposing a critical gap in current hallucination evaluation. To bridge this gap, we propose HQH, a High-Quality Hallucination benchmark, which demonstrates superior reliability and validity under HQM, serving as a credible evaluation tool. Our large-scale evaluation of popular LVLMs on HQH reveals severe hallucination problems, which occur not only in the models' main answer to a question but also in additional analysis. This highlights the necessity for future model improvements to effectively mitigate hallucinations and reduce the associated security risks in real-world applications. Our benchmark is publicly available at https://github.com/HQHBench/HQHBench.

View on arXiv PDF Code

Similar