CVMay 15, 2020

ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language

arXiv:2005.07327v2232 citations
AI Analysis

This addresses the problem of retrieving specific persons from images using textual descriptions for applications like surveillance or social media, with an incremental approach building on attribute-based methods.

The paper tackles person search by natural language by aligning visual and textual attributes, achieving state-of-the-art performance on tasks like person search by natural language and attribute-phrase queries.

Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches the given textual descriptions. While most of the current methods treat the task as a holistic visual and textual feature matching one, we approach it from an attribute-aligning perspective that allows grounding specific attribute phrases to the corresponding visual regions. We achieve success as well as the performance boosting by a robust feature learning that the referred identity can be accurately bundled by multiple attribute visual cues. To be concrete, our Visual-Textual Attribute Alignment model (dubbed as ViTAA) learns to disentangle the feature space of a person into subspaces corresponding to attributes using a light auxiliary attribute segmentation computing branch. It then aligns these visual features with the textual attributes parsed from the sentences by using a novel contrastive learning loss. Upon that, we validate our ViTAA framework through extensive experiments on tasks of person search by natural language and by attribute-phrase queries, on which our system achieves state-of-the-art performances. Code will be publicly available upon publication.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes