CVOct 16, 2023

Generalizable Person Search on Open-world User-Generated Video Content

arXiv:2310.10068v13 citationsh-index: 25
Originality Incremental advance
AI Analysis

This addresses the challenge of generalizing person search to open-world scenarios for developers, though it is incremental in improving domain adaptation.

The paper tackles the problem of person search models struggling with domain differences between training on user-generated videos and deploying in surveillance scenes, proposing a framework that achieves promising performance on benchmarks without target domain annotations.

Person search is a challenging task that involves detecting and retrieving individuals from a large set of un-cropped scene images. Existing person search applications are mostly trained and deployed in the same-origin scenarios. However, collecting and annotating training samples for each scene is often difficult due to the limitation of resources and the labor cost. Moreover, large-scale intra-domain data for training are generally not legally available for common developers, due to the regulation of privacy and public security. Leveraging easily accessible large-scale User Generated Video Contents (\emph{i.e.} UGC videos) to train person search models can fit the open-world distribution, but still suffering a performance gap from the domain difference to surveillance scenes. In this work, we explore enhancing the out-of-domain generalization capabilities of person search models, and propose a generalizable framework on both feature-level and data-level generalization to facilitate downstream tasks in arbitrary scenarios. Specifically, we focus on learning domain-invariant representations for both detection and ReID by introducing a multi-task prototype-based domain-specific batch normalization, and a channel-wise ID-relevant feature decorrelation strategy. We also identify and address typical sources of noise in open-world training frames, including inaccurate bounding boxes, the omission of identity labels, and the absence of cross-camera data. Our framework achieves promising performance on two challenging person search benchmarks without using any human annotation or samples from the target domain.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes