Just Say the Word: Annotation-Free Fine-Grained Object Counting
This addresses the challenge of accurately counting visually similar objects for applications like inventory or surveillance, offering an annotation-free solution that is incremental in leveraging existing counters.
The paper tackles the problem of fine-grained object counting without requiring new annotations or real images, by using a text-to-image diffusion model to generate synthetic data and tune a concept embedding that refines overcounts from existing counters. It demonstrates substantial improvements on a new benchmark of 1,037 images across 27 subcategories.
Fine-grained object counting remains a major challenge for class-agnostic counting models, which overcount visually similar but incorrect instances (e.g., jalapeño vs. poblano). Addressing this by annotating new data and fully retraining the model is time-consuming and does not guarantee generalization to additional novel categories at test time. Instead, we propose an alternative paradigm: Given a category name, tune a compact concept embedding derived from the prompt using synthetic images and pseudo-labels generated by a text-to-image diffusion model. This embedding conditions a specialization module that refines raw overcounts from any frozen counter into accurate, category-specific estimates\textemdash without requiring real images or human annotations. We validate our approach on \textsc{Lookalikes}, a challenging new benchmark containing 1,037 images across 27 fine-grained subcategories, and show substantial improvements over strong baselines. Code will be released upon acceptance. Dataset - https://dalessandro.dev/datasets/lookalikes/