CVSep 17, 2025

Can Current AI Models Count What We Mean, Not What They See? A Benchmark and Systematic Evaluation

Gia Khanh Nguyen, Yifeng Huang, Minh Hoai

arXiv:2509.13939v113.14 citationsh-index: 2DICTA

Originality Synthesis-oriented

AI Analysis

This addresses the challenge of intent-driven counting in complex scenes for AI and computer vision researchers, but it is incremental as it focuses on benchmarking rather than proposing a new method.

The paper tackles the problem of fine-grained visual counting by introducing PairTally, a benchmark dataset with 681 images requiring models to count objects based on subtle differences, and finds that current models struggle in these cases.

Visual counting is a fundamental yet challenging task, especially when users need to count objects of a specific type in complex scenes. While recent models, including class-agnostic counting models and large vision-language models (VLMs), show promise in counting tasks, their ability to perform fine-grained, intent-driven counting remains unclear. In this paper, we introduce PairTally, a benchmark dataset specifically designed to evaluate fine-grained visual counting. Each of the 681 high-resolution images in PairTally contains two object categories, requiring models to distinguish and count based on subtle differences in shape, size, color, or semantics. The dataset includes both inter-category (distinct categories) and intra-category (closely related subcategories) settings, making it suitable for rigorous evaluation of selective counting capabilities. We benchmark a variety of state-of-the-art models, including exemplar-based methods, language-prompted models, and large VLMs. Our results show that despite recent advances, current models struggle to reliably count what users intend, especially in fine-grained and visually ambiguous cases. PairTally provides a new foundation for diagnosing and improving fine-grained visual counting systems.

View on arXiv PDF

Similar