CVFeb 28, 2025

T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting

arXiv:2502.20625v311 citationsh-index: 6Has CodeCVPR
Originality Incremental advance
AI Analysis

This work addresses the challenge of limited text sensitivity in zero-shot counting for computer vision applications, representing an incremental improvement over existing methods.

The paper tackles the problem of zero-shot object counting by introducing T2ICount, a diffusion-based framework that improves text sensitivity through a Hierarchical Semantic Correction Module and Representational Regional Coherence Loss, achieving superior performance across benchmarks.

Zero-shot object counting aims to count instances of arbitrary object categories specified by text descriptions. Existing methods typically rely on vision-language models like CLIP, but often exhibit limited sensitivity to text prompts. We present T2ICount, a diffusion-based framework that leverages rich prior knowledge and fine-grained visual understanding from pretrained diffusion models. While one-step denoising ensures efficiency, it leads to weakened text sensitivity. To address this challenge, we propose a Hierarchical Semantic Correction Module that progressively refines text-image feature alignment, and a Representational Regional Coherence Loss that provides reliable supervision signals by leveraging the cross-attention maps extracted from the denosing U-Net. Furthermore, we observe that current benchmarks mainly focus on majority objects in images, potentially masking models' text sensitivity. To address this, we contribute a challenging re-annotated subset of FSC147 for better evaluation of text-guided counting ability. Extensive experiments demonstrate that our method achieves superior performance across different benchmarks. Code is available at https://github.com/cha15yq/T2ICount.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes