Boosting Weak Positives for Text Based Person Search
This work addresses a specific bottleneck in text-based person search for security and surveillance applications, representing an incremental improvement over existing alignment methods.
The paper tackles the problem of text-based person search where models prioritize easy image-text pairs and discard challenging samples as noise, by introducing a boosting technique that dynamically identifies and emphasizes these challenging samples during training. The method achieves improved performance across four pedestrian datasets.
Large vision-language models have revolutionized cross-modal object retrieval, but text-based person search (TBPS) remains a challenging task due to limited data and fine-grained nature of the task. Existing methods primarily focus on aligning image-text pairs into a common representation space, often disregarding the fact that real world positive image-text pairs share a varied degree of similarity in between them. This leads models to prioritize easy pairs, and in some recent approaches, challenging samples are discarded as noise during training. In this work, we introduce a boosting technique that dynamically identifies and emphasizes these challenging samples during training. Our approach is motivated from classical boosting technique and dynamically updates the weights of the weak positives, wherein, the rank-1 match does not share the identity of the query. The weight allows these misranked pairs to contribute more towards the loss and the network has to pay more attention towards such samples. Our method achieves improved performance across four pedestrian datasets, demonstrating the effectiveness of our proposed module.