CVAIOct 17, 2024

Trust but Verify: Programmatic VLM Evaluation in the Wild

arXiv:2410.13121v12 citationsh-index: 17
AI Analysis

This addresses the challenge of reliably quantifying VLM hallucinations for researchers and practitioners, though it is incremental as it builds on existing scene-graph and LLM methods.

The authors tackled the problem of evaluating Vision-Language Models (VLMs) for hallucinations in open-ended queries by proposing PROVE, a programmatic benchmarking paradigm, and found that few VLMs achieve a good balance between helpfulness and truthfulness on a benchmark of 10.5k QA pairs.

Vision-Language Models (VLMs) often generate plausible but incorrect responses to visual queries. However, reliably quantifying the effect of such hallucinations in free-form responses to open-ended queries is challenging as it requires visually verifying each claim within the response. We propose Programmatic VLM Evaluation (PROVE), a new benchmarking paradigm for evaluating VLM responses to open-ended queries. To construct PROVE, we provide a large language model (LLM) with a high-fidelity scene-graph representation constructed from a hyper-detailed image caption, and prompt it to generate diverse question-answer (QA) pairs, as well as programs that can be executed over the scene graph object to verify each QA pair. We thus construct a benchmark of 10.5k challenging but visually grounded QA pairs. Next, to evaluate free-form model responses to queries in PROVE, we propose a programmatic evaluation strategy that measures both the helpfulness and truthfulness of a response within a unified scene graph-based framework. We benchmark the helpfulness-truthfulness trade-offs of a range of VLMs on PROVE, finding that very few are in-fact able to achieve a good balance between the two. Project page: \url{https://prove-explorer.netlify.app/}.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes