Crowd Counting with Deep Structured Scale Integration Network
This work addresses the problem of accurate crowd counting in unconstrained scenes for applications like surveillance and public safety, representing a strong specific gain rather than a foundational advancement.
The paper tackles the challenge of scale variation in crowd counting by proposing DSSINet, which uses structured feature enhancement and a hierarchical loss function, achieving error reductions of 9.5% on Shanghaitech and 24.9% on UCF-QNRF datasets compared to state-of-the-art methods.
Automatic estimation of the number of people in unconstrained crowded scenes is a challenging task and one major difficulty stems from the huge scale variation of people. In this paper, we propose a novel Deep Structured Scale Integration Network (DSSINet) for crowd counting, which addresses the scale variation of people by using structured feature representation learning and hierarchically structured loss function optimization. Unlike conventional methods which directly fuse multiple features with weighted average or concatenation, we first introduce a Structured Feature Enhancement Module based on conditional random fields (CRFs) to refine multiscale features mutually with a message passing mechanism. In this module, each scale-specific feature is considered as a continuous random variable and passes complementary information to refine the features at other scales. Second, we utilize a Dilated Multiscale Structural Similarity loss to enforce our DSSINet to learn the local correlation of people's scales within regions of various size, thus yielding high-quality density maps. Extensive experiments on four challenging benchmarks well demonstrate the effectiveness of our method. Specifically, our DSSINet achieves improvements of 9.5% error reduction on Shanghaitech dataset and 24.9% on UCF-QNRF dataset against the state-of-the-art methods.