CVSep 22, 2018

Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search

Ya Jing, Chenyang Si, Junbo Wang, Wei Wang, Liang Wang, Tieniu Tan

arXiv:1809.08440v316.9183 citations

Originality Incremental advance

AI Analysis

This work addresses cross-modal matching for video surveillance applications, representing an incremental advance by incorporating pose information and multi-granularity alignment.

The paper tackles the problem of text-based person search by proposing a pose-guided multi-granularity attention network to exploit multi-level semantic relevance between images and descriptions, achieving a 15% improvement in top-1 accuracy on the CUHK-PEDES dataset.

Text-based person search aims to retrieve the corresponding person images in an image database by virtue of a describing sentence about the person, which poses great potential for various applications such as video surveillance. Extracting visual contents corresponding to the human description is the key to this cross-modal matching problem. Moreover, correlated images and descriptions involve different granularities of semantic relevance, which is usually ignored in previous methods. To exploit the multilevel corresponding visual contents, we propose a pose-guided multi-granularity attention network (PMA). Firstly, we propose a coarse alignment network (CA) to select the related image regions to the global description by a similarity-based attention. To further capture the phrase-related visual body part, a fine-grained alignment network (FA) is proposed, which employs pose information to learn latent semantic alignment between visual body part and textual noun phrase. To verify the effectiveness of our model, we perform extensive experiments on the CUHK Person Description Dataset (CUHK-PEDES) which is currently the only available dataset for text-based person search. Experimental results show that our approach outperforms the state-of-the-art methods by 15 \% in terms of the top-1 metric.

View on arXiv PDF

Similar