CONQUER: Context-Aware Representation with Query Enhancement for Text-Based Person Search
This work addresses a practical challenge in public safety applications by enhancing retrieval accuracy for text-based person search, though it appears incremental as it builds on existing methods with specific refinements.
The paper tackles the problem of retrieving pedestrian images using natural language descriptions in Text-Based Person Search, hindered by cross-modal discrepancies and ambiguous queries, and introduces CONQUER, a two-stage framework that improves Rank-1 accuracy and mAP across multiple datasets, with notable gains in cross-domain and incomplete-query scenarios.
Text-Based Person Search (TBPS) aims to retrieve pedestrian images from large galleries using natural language descriptions. This task, essential for public safety applications, is hindered by cross-modal discrepancies and ambiguous user queries. We introduce CONQUER, a two-stage framework designed to address these challenges by enhancing cross-modal alignment during training and adaptively refining queries at inference. During training, CONQUER employs multi-granularity encoding, complementary pair mining, and context-guided optimal matching based on Optimal Transport to learn robust embeddings. At inference, a plug-and-play query enhancement module refines vague or incomplete queries via anchor selection and attribute-driven enrichment, without requiring retraining of the backbone. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid demonstrate that CONQUER consistently outperforms strong baselines in both Rank-1 accuracy and mAP, yielding notable improvements in cross-domain and incomplete-query scenarios. These results highlight CONQUER as a practical and effective solution for real-world TBPS deployment. Source code is available at https://github.com/zqxie77/CONQUER.