CVApr 15, 2025

DMPT: Decoupled Modality-aware Prompt Tuning for Multi-modal Object Re-identification

arXiv:2504.10985v12 citationsh-index: 3WACV
Originality Incremental advance
AI Analysis

This addresses the problem of high computational and storage costs for researchers and practitioners in computer vision, though it is incremental as it builds on existing prompt-tuning ideas.

The paper tackles the computational inefficiency of fine-tuning large pre-trained models for multi-modal object re-identification by proposing DMPT, a prompt-tuning framework that freezes the backbone and optimizes only 6.5% of parameters, achieving competitive results on benchmarks.

Current multi-modal object re-identification approaches based on large-scale pre-trained backbones (i.e., ViT) have displayed remarkable progress and achieved excellent performance. However, these methods usually adopt the standard full fine-tuning paradigm, which requires the optimization of considerable backbone parameters, causing extensive computational and storage requirements. In this work, we propose an efficient prompt-tuning framework tailored for multi-modal object re-identification, dubbed DMPT, which freezes the main backbone and only optimizes several newly added decoupled modality-aware parameters. Specifically, we explicitly decouple the visual prompts into modality-specific prompts which leverage prior modality knowledge from a powerful text encoder and modality-independent semantic prompts which extract semantic information from multi-modal inputs, such as visible, near-infrared, and thermal-infrared. Built upon the extracted features, we further design a Prompt Inverse Bind (PromptIBind) strategy that employs bind prompts as a medium to connect the semantic prompt tokens of different modalities and facilitates the exchange of complementary multi-modal information, boosting final re-identification results. Experimental results on multiple common benchmarks demonstrate that our DMPT can achieve competitive results to existing state-of-the-art methods while requiring only 6.5% fine-tuning of the backbone parameters.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes