CV AIApr 1, 2025

IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval

Bangwei Liu, Yicheng Bao, Shaohui Lin, Xuhong Wang, Xin Tan, Yingchun Wang, Yuan Xie, Chaochao Lu

arXiv:2504.00954v117.49 citationsh-index: 46Has Code2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

Originality Incremental advance

AI Analysis

This addresses the lack of complexity and practical application value in current multimodal retrieval systems for fields like embodied AI and digital content industries, though it is incremental as it builds on existing MLLM methods.

The paper tackles the problem of fine-grained instance-level visual correspondence in multimodal retrieval by introducing the Instance-Driven Multimodal Image Retrieval (IDMR) task, which requires retrieving images with the same instance as a query image while matching a text-described scenario, and their MLLM-based model outperforms state-of-the-art approaches on both traditional benchmarks and their new IDMR-bench.

Multimodal retrieval systems are becoming increasingly vital for cutting-edge AI technologies, such as embodied AI and AI-driven digital content industries. However, current multimodal retrieval tasks lack sufficient complexity and demonstrate limited practical application value. It spires us to design Instance-Driven Multimodal Image Retrieval (IDMR), a novel task that requires models to retrieve images containing the same instance as a query image while matching a text-described scenario. Unlike existing retrieval tasks focused on global image similarity or category-level matching, IDMR demands fine-grained instance-level consistency across diverse contexts. To benchmark this capability, we develop IDMR-bench using real-world object tracking and first-person video data. Addressing the scarcity of training data, we propose a cross-domain synthesis method that creates 557K training samples by cropping objects from standard detection datasets. Our Multimodal Large Language Model (MLLM) based retrieval model, trained on 1.2M samples, outperforms state-of-the-art approaches on both traditional benchmarks and our zero-shot IDMR-bench. Experimental results demonstrate previous models' limitations in instance-aware retrieval and highlight the potential of MLLM for advanced retrieval applications. The whole training dataset, codes and models, with wide ranges of sizes, are available at https://github.com/BwLiu01/IDMR.

View on arXiv PDF Code

Similar