CVMay 11

NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-Identification

Shihao Li, Huaibo Huang, Junxian Duan, Aihua Zheng, Jin Tang, Jixin Ma

arXiv:2505.2000115.92 citationsh-index: 14

Predicted impact top 54% in CV · last 90 daysOriginality Incremental advance

AI Analysis

For multi-modal object ReID researchers, this work improves fine-grained recognition across modalities by leveraging high-quality captions and a multi-expert architecture.

This paper addresses multi-modal object ReID by proposing a caption generation pipeline that reduces unknown recognition rates of MLLMs and a framework (NEXT) with text-modulated semantic and structural experts, achieving state-of-the-art results on two person and three vehicle datasets.

Multi-modal object Re-IDentification (ReID) aims to obtain complete identity features across heterogeneous modalities. However, most existing methods rely on implicit feature fusion modules, making it difficult to model fine-grained recognition patterns under various challenges in real world. Benefiting from the powerful Multi-modal Large Language Models (MLLMs), the object appearances are effectively translated into descriptive captions. In this paper, we propose a reliable caption generation pipeline based on attribute confidence, which significantly reduces the unknown recognition rate of MLLMs and improves the quality of generated text. Additionally, to model diverse identity patterns, we propose a novel ReID framework, named NEXT, the Multi-grained Mixture of Experts via Text-Modulation for Multi-modal Object Re-Identification. Specifically, we decouple the recognition problem into semantic and structural branches to separately capture fine-grained appearance features and coarsegrained structure features. For semantic recognition, we first propose a Text-Modulated Semantic Experts (TMSE), which randomly samples high-quality captions to modulate experts capturing semantic features and mining inter-modality complementary cues. Second, to recognize structure features, we propose a Context-Shared Structure Experts (CSSE), which focuses on the holistic object structure and maintains identity structural consistency via a soft routing mechanism. Finally, we propose a Multi-Grained Features Aggregation (MGFA), which adopts a unified fusion strategy to effectively integrate multi-grained expert features into the final identity representations. Extensive experiments on two public person datasets and three vehicle datasets demonstrate the effectiveness of our method, showing that it significantly outperforms existing state-of-the-art methods.

View on arXiv PDF

Similar