IRAIAug 23, 2025

Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation

arXiv:2508.17079v13 citationsh-index: 2EMNLP
Originality Incremental advance
AI Analysis

This addresses the challenge of multimodal document retrieval for private or siloed data, offering a solution for real-world applications with incremental improvements over existing methods.

The paper tackles the problem of retrieving multimodal documents in unseen domains or languages by introducing PREMIR, a framework that uses cross-modal question generation before retrieval, achieving state-of-the-art performance on out-of-distribution benchmarks with improvements across all retrieval metrics.

Rapid advances in Multimodal Large Language Models (MLLMs) have expanded information retrieval beyond purely textual inputs, enabling retrieval from complex real world documents that combine text and visuals. However, most documents are private either owned by individuals or confined within corporate silos and current retrievers struggle when faced with unseen domains or languages. To address this gap, we introduce PREMIR, a simple yet effective framework that leverages the broad knowledge of an MLLM to generate cross modal pre questions (preQs) before retrieval. Unlike earlier multimodal retrievers that compare embeddings in a single vector space, PREMIR leverages preQs from multiple complementary modalities to expand the scope of matching to the token level. Experiments show that PREMIR achieves state of the art performance on out of distribution benchmarks, including closed domain and multilingual settings, outperforming strong baselines across all retrieval metrics. We confirm the contribution of each component through in depth ablation studies, and qualitative analyses of the generated preQs further highlight the model's robustness in real world settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes