CVJul 30, 2020

Pixel-wise Crowd Understanding via Synthetic Data

arXiv:2007.16032v2130 citations
AI Analysis

This work addresses the data scarcity issue in crowd analysis for video surveillance applications, offering a practical solution through synthetic data generation, though it is incremental as it builds on existing domain adaptation techniques.

The paper tackles the problem of pixel-wise crowd understanding, which requires large labeled datasets that are expensive to obtain, by generating a synthetic crowd dataset using Grand Theft Auto V and proposing methods like pre-training and domain adaptation to improve performance on real data, resulting in better model accuracy in real-world crowd scenes.

Crowd analysis via computer vision techniques is an important topic in the field of video surveillance, which has wide-spread applications including crowd monitoring, public safety, space design and so on. Pixel-wise crowd understanding is the most fundamental task in crowd analysis because of its finer results for video sequences or still images than other analysis tasks. Unfortunately, pixel-level understanding needs a large amount of labeled training data. Annotating them is an expensive work, which causes that current crowd datasets are small. As a result, most algorithms suffer from over-fitting to varying degrees. In this paper, take crowd counting and segmentation as examples from the pixel-wise crowd understanding, we attempt to remedy these problems from two aspects, namely data and methodology. Firstly, we develop a free data collector and labeler to generate synthetic and labeled crowd scenes in a computer game, Grand Theft Auto V. Then we use it to construct a large-scale, diverse synthetic crowd dataset, which is named as "GCC Dataset". Secondly, we propose two simple methods to improve the performance of crowd understanding via exploiting the synthetic data. To be specific, 1) supervised crowd understanding: pre-train a crowd analysis model on the synthetic data, then fine-tune it using the real data and labels, which makes the model perform better on the real world; 2) crowd understanding via domain adaptation: translate the synthetic data to photo-realistic images, then train the model on translated data and labels. As a result, the trained model works well in real crowd scenes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes