CLCVIRFeb 18, 2025

Towards Text-Image Interleaved Retrieval

arXiv:2502.12799v11 citationsh-index: 14ACL
Originality Incremental advance
AI Analysis

This addresses a practical limitation in multimodal retrieval for real-world applications like tutorials, though it is incremental in adapting existing techniques.

The paper tackles the problem of retrieving documents containing interleaved text and images, which existing multimodal retrieval models struggle with, and introduces a new benchmark and method that achieves significant improvements over baseline approaches.

Current multimodal information retrieval studies mainly focus on single-image inputs, which limits real-world applications involving multiple images and text-image interleaved content. In this work, we introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences, and the model is required to understand the semantics from the interleaved context for effective retrieval. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. To explore the task, we adapt several off-the-shelf retrievers and build a dense baseline by interleaved multimodal large language model (MLLM). We then propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity, to address the challenge of excessive visual tokens in MLLM-based TIIR models. Experiments demonstrate that simple adaption of existing models does not consistently yield effective results. Our MME achieves significant improvements over the baseline by substantially fewer visual tokens. We provide extensive analysis and will release the dataset and code to facilitate future research.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes