CVLGJul 17, 2025

COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark

arXiv:2507.13405v12 citationsh-index: 3
Originality Synthesis-oriented
AI Analysis

This addresses a gap in evaluating VLMs' ability to reason over crowded scenes, but it is incremental as it introduces a new benchmark rather than a novel method.

The paper tackled the lack of benchmarks for visual entailment reasoning in Vision-Language Models (VLMs) by proposing COREVQA, a dataset of 5608 image and true/false statement pairs derived from crowded images, and found that top-performing VLMs achieve accuracy below 80%, with others as low as 39.98%.

Recently, many benchmarks and datasets have been developed to evaluate Vision-Language Models (VLMs) using visual question answering (VQA) pairs, and models have shown significant accuracy improvements. However, these benchmarks rarely test the model's ability to accurately complete visual entailment, for instance, accepting or refuting a hypothesis based on the image. To address this, we propose COREVQA (Crowd Observations and Reasoning Entailment), a benchmark of 5608 image and synthetically generated true/false statement pairs, with images derived from the CrowdHuman dataset, to provoke visual entailment reasoning on challenging crowded images. Our results show that even the top-performing VLMs achieve accuracy below 80%, with other models performing substantially worse (39.98%-69.95%). This significant performance gap reveals key limitations in VLMs' ability to reason over certain types of image-question pairs in crowded scenes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes