CVApr 9, 2023

CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model

arXiv:2304.04231v195 citationsh-index: 27Has Code
Originality Highly original
AI Analysis

This addresses the costly manual labeling issue in crowd counting for applications like surveillance and event management, offering a novel unsupervised approach.

The paper tackles the problem of unsupervised crowd counting by proposing CrowdCLIP, which leverages a vision-language model to map crowd patches to count text, achieving superior performance compared to previous unsupervised methods and even surpassing some supervised ones in cross-dataset settings.

Supervised crowd counting relies heavily on costly manual labeling, which is difficult and expensive, especially in dense scenes. To alleviate the problem, we propose a novel unsupervised framework for crowd counting, named CrowdCLIP. The core idea is built on two observations: 1) the recent contrastive pre-trained vision-language model (CLIP) has presented impressive performance on various downstream tasks; 2) there is a natural mapping between crowd patches and count text. To the best of our knowledge, CrowdCLIP is the first to investigate the vision language knowledge to solve the counting problem. Specifically, in the training stage, we exploit the multi-modal ranking loss by constructing ranking text prompts to match the size-sorted crowd patches to guide the image encoder learning. In the testing stage, to deal with the diversity of image patches, we propose a simple yet effective progressive filtering strategy to first select the highly potential crowd patches and then map them into the language space with various counting intervals. Extensive experiments on five challenging datasets demonstrate that the proposed CrowdCLIP achieves superior performance compared to previous unsupervised state-of-the-art counting methods. Notably, CrowdCLIP even surpasses some popular fully-supervised methods under the cross-dataset setting. The source code will be available at https://github.com/dk-liang/CrowdCLIP.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes