CVDec 11, 2024

Embedding and Enriching Explicit Semantics for Visible-Infrared Person Re-Identification

arXiv:2412.08406v11 citationsh-index: 9
Originality Incremental advance
AI Analysis

This addresses the challenge of retrieving pedestrian images across different modalities for surveillance applications, representing an incremental improvement by incorporating explicit semantics from language models.

The paper tackles the problem of visible-infrared person re-identification by proposing an EEES framework to learn semantically rich cross-modality pedestrian representations, achieving state-of-the-art performance with experimental results demonstrating its effectiveness and superiority.

Visible-infrared person re-identification (VIReID) retrieves pedestrian images with the same identity across different modalities. Existing methods learn visual content solely from images, lacking the capability to sense high-level semantics. In this paper, we propose an Embedding and Enriching Explicit Semantics (EEES) framework to learn semantically rich cross-modality pedestrian representations. Our method offers several contributions. First, with the collaboration of multiple large language-vision models, we develop Explicit Semantics Embedding (ESE), which automatically supplements language descriptions for pedestrians and aligns image-text pairs into a common space, thereby learning visual content associated with explicit semantics. Second, recognizing the complementarity of multi-view information, we present Cross-View Semantics Compensation (CVSC), which constructs multi-view image-text pair representations, establishes their many-to-many matching, and propagates knowledge to single-view representations, thus compensating visual content with its missing cross-view semantics. Third, to eliminate noisy semantics such as conflicting color attributes in different modalities, we design Cross-Modality Semantics Purification (CMSP), which constrains the distance between inter-modality image-text pair representations to be close to that between intra-modality image-text pair representations, further enhancing the modality-invariance of visual content. Finally, experimental results demonstrate the effectiveness and superiority of the proposed EEES.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes