CVMar 14, 2024

CLIP-EBC: CLIP Can Count Accurately through Enhanced Blockwise Classification

arXiv:2403.09281v312.820 citationsHas CodeICME

Originality Incremental advance

AI Analysis

This work addresses the challenge of accurate crowd counting from images, which is important for applications like public safety and urban planning, by introducing a novel framework that improves upon existing classification-based methods, though it is incremental in leveraging CLIP for a specific domain task.

The paper tackles the problem of crowd density estimation by proposing CLIP-EBC, a fully CLIP-based model that enhances counting accuracy through an Enhanced Blockwise Classification framework, achieving state-of-the-art performance with an MAE of 58.2 and RMSE of 268.5 on the NWPU-Crowd test set, representing improvements of 8.6% and 13.3% over previous methods.

We propose CLIP-EBC, the first fully CLIP-based model for accurate crowd density estimation. While the CLIP model has demonstrated remarkable success in addressing recognition tasks such as zero-shot image classification, its potential for counting has been largely unexplored due to the inherent challenges in transforming a regression problem, such as counting, into a recognition task. In this work, we investigate and enhance CLIP's ability to count, focusing specifically on the task of estimating crowd sizes from images. Existing classification-based crowd-counting frameworks have significant limitations, including the quantization of count values into bordering real-valued bins and the sole focus on classification errors. These practices result in label ambiguity near the shared borders and inaccurate prediction of count values. Hence, directly applying CLIP within these frameworks may yield suboptimal performance. To address these challenges, we first propose the Enhanced Blockwise Classification (EBC) framework. Unlike previous methods, EBC utilizes integer-valued bins, effectively reducing ambiguity near bin boundaries. Additionally, it incorporates a regression loss based on density maps to improve the prediction of count values. Within our backbone-agnostic EBC framework, we then introduce CLIP-EBC to fully leverage CLIP's recognition capabilities for this task. Extensive experiments demonstrate the effectiveness of EBC and the competitive performance of CLIP-EBC. Specifically, our EBC framework can improve existing classification-based methods by up to 44.5% on the UCF-QNRF dataset, and CLIP-EBC achieves state-of-the-art performance on the NWPU-Crowd test set, with an MAE of 58.2 and an RMSE of 268.5, representing improvements of 8.6% and 13.3% over the previous best method, STEERER. The code and weights are available at https://github.com/Yiming-M/CLIP-EBC.

View on arXiv PDF Code

Similar