ASMR: Learning Attribute-Based Person Search with Adaptive Semantic Margin Regularizer
This addresses the challenge of finding person images based on text attributes, which is an incremental improvement in cross-modal retrieval for specific domains like surveillance or identification.
The paper tackles the problem of attribute-based person search by reducing the modality gap between text attributes and images, achieving state-of-the-art results on public benchmarks with a simple embedding model.
Attribute-based person search is the task of finding person images that are best matched with a set of text attributes given as query. The main challenge of this task is the large modality gap between attributes and images. To reduce the gap, we present a new loss for learning cross-modal embeddings in the context of attribute-based person search. We regard a set of attributes as a category of people sharing the same traits. In a joint embedding space of the two modalities, our loss pulls images close to their person categories for modality alignment. More importantly, it pushes apart a pair of person categories by a margin determined adaptively by their semantic distance, where the distance metric is learned end-to-end so that the loss considers importance of each attribute when relating person categories. Our loss guided by the adaptive semantic margin leads to more discriminative and semantically well-arranged distributions of person images. As a consequence, it enables a simple embedding model to achieve state-of-the-art records on public benchmarks without bells and whistles.