LLaVA-ReID: Selective Multi-image Questioner for Interactive Person Re-Identification
This addresses the limitation of traditional text-based person ReID for real-world scenarios where witness descriptions are often partial, though it is incremental as it builds on existing ReID methods.
The paper tackles the problem of incomplete or vague witness descriptions in person re-identification by introducing an interactive dialogue-based retrieval task (Inter-ReID) and proposes LLaVA-ReID, a model that generates targeted questions to refine descriptions, which significantly outperforms baselines on benchmarks.
Traditional text-based person ReID assumes that person descriptions from witnesses are complete and provided at once. However, in real-world scenarios, such descriptions are often partial or vague. To address this limitation, we introduce a new task called interactive person re-identification (Inter-ReID). Inter-ReID is a dialogue-based retrieval task that iteratively refines initial descriptions through ongoing interactions with the witnesses. To facilitate the study of this new task, we construct a dialogue dataset that incorporates multiple types of questions by decomposing fine-grained attributes of individuals. We further propose LLaVA-ReID, a question model that generates targeted questions based on visual and textual contexts to elicit additional details about the target person. Leveraging a looking-forward strategy, we prioritize the most informative questions as supervision during training. Experimental results on both Inter-ReID and text-based ReID benchmarks demonstrate that LLaVA-ReID significantly outperforms baselines.