IRAICVMay 10, 2025

OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval

arXiv:2505.07879v312 citationsh-index: 6ACL
Originality Incremental advance
AI Analysis

This work addresses the problem of efficient multimodal retrieval for vision-language RAG systems in KB-VQA, offering an incremental improvement by better integrating modalities and granularities.

The paper tackles the challenge of multimodal retrieval in vision-language retrieval-augmented generation for Knowledge-Based Visual Question Answering by proposing a system that orchestrates multiple granularities and modalities through coarse-to-fine retrieval. It achieves state-of-the-art retrieval performance and competitive answering results on benchmarks like InfoSeek and Encyclopedic-VQA.

Vision-language retrieval-augmented generation (RAG) has become an effective approach for tackling Knowledge-Based Visual Question Answering (KB-VQA), which requires external knowledge beyond the visual content presented in images. The effectiveness of Vision-language RAG systems hinges on multimodal retrieval, which is inherently challenging due to the diverse modalities and knowledge granularities in both queries and knowledge bases. Existing methods have not fully tapped into the potential interplay between these elements. We propose a multimodal RAG system featuring a coarse-to-fine, multi-step retrieval that harmonizes multiple granularities and modalities to enhance efficacy. Our system begins with a broad initial search aligning knowledge granularity for cross-modal retrieval, followed by a multimodal fusion reranking to capture the nuanced multimodal information for top entity selection. A text reranker then filters out the most relevant fine-grained section for augmented generation. Extensive experiments on the InfoSeek and Encyclopedic-VQA benchmarks show our method achieves state-of-the-art retrieval performance and highly competitive answering results, underscoring its effectiveness in advancing KB-VQA systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes