A Fixed-Point Approach to Unified Prompt-Based Counting
This work addresses the limitation of single-prompt counting models for researchers and practitioners in computer vision, offering an incremental improvement with a fixed-point inference method and contrastive training to reduce dataset bias.
The paper tackles the problem of class-agnostic counting by developing a unified framework that handles multiple prompt types (e.g., box, point, text) to generate density maps, achieving superior performance in datasets and cross-dataset adaptation tasks.
Existing class-agnostic counting models typically rely on a single type of prompt, e.g., box annotations. This paper aims to establish a comprehensive prompt-based counting framework capable of generating density maps for concerned objects indicated by various prompt types, such as box, point, and text. To achieve this goal, we begin by converting prompts from different modalities into prompt masks without requiring training. These masks are then integrated into a class-agnostic counting methodology for predicting density maps. Furthermore, we introduce a fixed-point inference along with an associated loss function to improve counting accuracy, all without introducing new parameters. The effectiveness of this method is substantiated both theoretically and experimentally. Additionally, a contrastive training scheme is implemented to mitigate dataset bias inherent in current class-agnostic counting datasets, a strategy whose effectiveness is confirmed by our ablation study. Our model excels in prominent class-agnostic datasets and exhibits superior performance in cross-dataset adaptation tasks.