CVAISep 25, 2025

CompareBench: A Benchmark for Visual Comparison Reasoning in Vision-Language Models

arXiv:2509.22737v15 citationsh-index: 4Has Code
Originality Incremental advance
AI Analysis

This addresses a fundamental yet understudied skill in vision-language models, providing a diagnostic benchmark for researchers, though it is incremental as it builds on existing evaluation frameworks.

The authors tackled the problem of evaluating visual comparison reasoning in vision-language models by introducing CompareBench, a benchmark with 1000 QA pairs across four tasks, and found that even the strongest models consistently fail at temporal ordering and spatial relations, with mistakes in basic counting and geometric comparisons.

We introduce CompareBench, a benchmark for evaluating visual comparison reasoning in vision-language models (VLMs), a fundamental yet understudied skill. CompareBench consists of 1000 QA pairs across four tasks: quantity (600), temporal (100), geometric (200), and spatial (100). It is derived from two auxiliary datasets that we constructed: TallyBench (2000 counting images with QA) and HistCaps (515 historical images with bilingual captions). We evaluate both closed-source APIs (OpenAI, Gemini, Claude) and open-source models (Qwen2.5-VL and Qwen3-VL series). Results show clear scaling trends but also reveal critical limitations: even the strongest models consistently fail at temporal ordering and spatial relations, and they often make mistakes in basic counting and geometric comparisons that are trivial for humans. These findings demonstrate that visual comparison remains a systematic blind spot for current VLMs. By providing controlled, diverse, and diagnostic evaluation, CompareBench establishes a foundation for advancing more reliable multimodal reasoning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes