CVNov 21, 2025

Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions

arXiv:2511.17722v24 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This work addresses counting accuracy issues in VLMs for AI researchers, but it is incremental as it builds on prior research with synthetic benchmarks.

The study tackled the problem of Vision-Language Models (VLMs) struggling with counting tasks due to biases and varying input complexities, finding that attention-based interventions led to modest performance gains.

Recent research suggests that Vision Language Models (VLMs) often rely on inherent biases learned during training when responding to queries about visual properties of images. These biases are exacerbated when VLMs are asked highly specific questions that require them to focus on particular areas of the image in tasks such as counting. We build upon this research by developing a synthetic benchmark dataset and evaluation framework to systematically determine how counting performance varies as image and prompt properties change. Using open-source VLMs, we then analyze how attention allocation fluctuates with varying input parameters (e.g. number of objects in the image, objects color, background color, objects texture, background texture, and prompt specificity). We further implement attention-based interventions to modulate focus on visual tokens at different layers and evaluate their impact on counting performance across a range of visual conditions. Our experiments reveal that while VLM counting performance remains challenging, especially under high visual or linguistic complexity, certain attention interventions can lead to modest gains in counting performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes